[go: up one dir, main page]

CN119248617A - Elasticsearch cluster fault intelligent detection system - Google Patents

Elasticsearch cluster fault intelligent detection system Download PDF

Info

Publication number
CN119248617A
CN119248617A CN202411575774.7A CN202411575774A CN119248617A CN 119248617 A CN119248617 A CN 119248617A CN 202411575774 A CN202411575774 A CN 202411575774A CN 119248617 A CN119248617 A CN 119248617A
Authority
CN
China
Prior art keywords
fault
data
module
model
fault type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411575774.7A
Other languages
Chinese (zh)
Inventor
杨昌玉
原帅
刘衍琦
彭洪浩
吴雪安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yantai Institute Of Technology
Original Assignee
Yantai Institute Of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yantai Institute Of Technology filed Critical Yantai Institute Of Technology
Priority to CN202411575774.7A priority Critical patent/CN119248617A/en
Publication of CN119248617A publication Critical patent/CN119248617A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3476Data logging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Quality & Reliability (AREA)
  • Computer Hardware Design (AREA)
  • Probability & Statistics with Applications (AREA)
  • Test And Diagnosis Of Digital Computers (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention provides an intelligent detection system for faults of an elastic search cluster, which relates to the technical field of big data storage and retrieval, and comprises a data acquisition module, an elastic search cluster fault analysis module, an expert system module and a model optimization and self-learning module, wherein the current log data and monitoring index data of the elastic search cluster are analyzed through a fault detection model of the elastic search cluster fault analysis module, the predicted fault type of the elastic search cluster is determined, and automatic detection and prediction are realized; the deep learning algorithm and the expert rule base are fused, so that complex and dynamic fault scenes can be efficiently processed, false alarm and missing report rate is reduced, detection accuracy is improved, self-learning and dynamic optimization of a fault detection model are realized through the model optimization and self-learning module, and operation and maintenance complexity and cost are reduced.

Description

Intelligent detection system for failure of elastic search cluster
Technical Field
The invention relates to the technical field of big data storage and retrieval, in particular to an elastic search cluster fault intelligent detection system.
Background
Distributed search engines elastsearch is a distributed, real-time, high-performance search and analysis engine that can process large amounts of data and provide fast, accurate search results. In practical applications, the elastic search cluster may encounter various failures, such as node downtime, disk fullness, network failure, etc. Detection of an elastiscearch cluster failure is thus crucial.
In the prior art, the fault detection of the elastic search cluster mainly depends on a traditional monitoring tool and manual investigation, and generally comprises the following aspects:
(1) Performance monitoring and log analysis, namely evaluating the health condition of the cluster by monitoring various performance indexes (such as central processing unit (Central Processing Unit, CPU) utilization rate, memory utilization rate, query delay and the like) of the elastic search cluster, and simultaneously analyzing the elastic search log by using a log analysis tool to monitor abnormal behaviors or error information;
(2) Visualization tools applications, using a Kibana, grafana or other visualization tools to expose various performance metrics of the cluster, and deploying a separate log analysis tool (e.g., logstack) to process and analyze log data;
(3) And (3) real-time monitoring and history recording, namely real-time monitoring the running state of the elastic search cluster, and timely finding and processing the abnormality and the fault in the cluster. And meanwhile, a history record of cluster operation conditions is provided, and fault investigation and performance optimization are assisted.
However, the prior art relies on static monitoring and manual experience analysis, cannot efficiently process complex and dynamic fault scenes, is difficult to predict and prevent potential faults, and has low efficiency, complexity of operation and maintenance and high cost.
Disclosure of Invention
Aiming at the problems existing in the prior art, the embodiment of the invention provides an intelligent detection system for an elastic search cluster fault.
The invention provides an elastic search cluster fault intelligent detection system, which comprises a data acquisition module, an elastic search cluster fault analysis module, an expert system module and a model optimization and self-learning module, wherein:
the data acquisition module is used for acquiring current log data and monitoring index data of the elastic search cluster;
The elastic search cluster fault analysis module is used for extracting feature vectors of the current log data and the monitoring index data by using a natural language processing NLP algorithm, inputting the feature vectors into a fault detection model to obtain probability distribution of each fault type output by the fault detection model and determine the predicted fault type of the elastic search cluster, wherein the fault detection model is a deep learning-based large-scale pre-training language model and is obtained by training based on history log data and monitoring index data marked as normal, history log data and monitoring index data marked as faults and related fault types which are included in training data;
The expert system module is used for verifying the predicted fault type according to the current log data, the monitoring index data and an expert rule base, and determining the target fault type of the elastic search cluster according to a verification result, wherein the expert rule base comprises a plurality of rules, and each rule comprises a fault type, an elastic search cluster fault symptom and a solution;
The model optimization and self-learning module is used for carrying out model training on the fault detection model according to the current log data and the monitoring index data, the predicted fault type and the target fault type.
Optionally, the expert system module is specifically configured to:
Matching and verifying the current log data, the monitoring index data and the predicted fault type with rules of the expert rule base to obtain a verification result;
When the verification result is that a target rule exists in the expert rule base, and the fault type of the target rule and the fault symptom of the elastic search cluster are matched with the predicted fault type, the current log data and the monitoring index data, determining that the target fault type of the elastic search cluster is the predicted fault type;
And when the verification result is that the target rule matched with the current log data, the monitoring index data and the predicted fault type does not exist in the expert rule base, determining the fault type associated with the fault symptom of the elastic search cluster in the expert rule base as the target fault type of the elastic search cluster according to the fault symptom of the elastic search cluster corresponding to the current log data and the monitoring index data in the expert rule base.
Optionally, the fault detection model includes an input layer, a convolution layer, a pooling layer, a fully connected layer, and a Softmax layer;
the input layer receives the feature vector, the convolution layer extracts local features in the feature vector through a convolution kernel to identify each fault type, the pooling layer downsamples the local features extracted by the convolution layer to obtain key features, the full-connection layer maps the key features sampled by the pooling layer to each fault type to obtain feature mapping results, and the Softmax layer generates probability distribution of each fault type according to the feature mapping results output by the full-connection layer.
Optionally, the data acquisition module is further configured to:
and collecting the historical log data, the monitoring index data, various fault types and solutions, and sending the historical log data, the monitoring index data, the various fault types and solutions to the model optimization and self-learning module.
Optionally, the model optimization and self-learning module is further configured to:
marking the history log data and the monitoring index data as corresponding fault types according to the fault symptoms of the elastic search cluster corresponding to the history log data and the monitoring index data, recording occurrence time, triggering conditions, operation steps and influence degree of the faults on the elastic search cluster and the service corresponding to the history log data and the monitoring index data marked as the faults, and generating the training data set.
Optionally, the model optimization and self-learning module is further configured to:
And according to the newly-added historical log data and the monitoring index data, periodically performing model retraining and model updating on the fault detection model.
Optionally, the model optimization and self-learning module is further configured to:
and training the fault detection model again according to the updating rule and the newly added rule in the expert rule base.
Optionally, the model optimization and self-learning module is further configured to:
And measuring the difference between the predicted fault type and the target fault type by adopting a cross entropy loss function, and updating parameters of the fault detection model.
Optionally, the fault type includes at least one of:
index fragmentation, query delay, central Processing Unit (CPU) overload, memory overflow, main node failure, data node failure, network partition failure.
Optionally, the system further comprises:
the system monitoring and feedback module is used for monitoring the running conditions of the data acquisition module, the elastic search cluster fault analysis module, the expert system module and the model optimization and self-learning module.
The invention also provides an intelligent detection method for the failure of the elastic search cluster, which comprises the following steps:
Collecting current log data and monitoring index data of an elastic search cluster;
Extracting feature vectors of the current log data and the monitoring index data by using a natural language processing NLP algorithm, inputting the feature vectors into a fault detection model to obtain probability distribution of each fault type output by the fault detection model, and determining the predicted fault type of the elastic search cluster, wherein the fault detection model is a large-scale pre-training language model based on deep learning, and is obtained by training based on historical log data and monitoring index data marked as normal, historical log data and monitoring index data marked as faults and related fault types included in a training data set;
Verifying the predicted fault type according to the current log data, the monitoring index data and an expert rule base, and determining the target fault type of the elastic search cluster according to a verification result, wherein the expert rule base comprises a plurality of rules, and each rule comprises a fault type, an elastic search cluster fault symptom and a solution;
and carrying out model training on the fault detection model according to the current log data, the monitoring index data, the predicted fault type and the target fault type.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the intelligent detection method for the failure of the elastic search cluster is realized when the processor executes the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements an elastic search cluster fault intelligent detection method as described above.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements an elastiscearch cluster fault intelligentized detection method as described above.
The intelligent detection system for the faults of the elastic search clusters comprises a data acquisition module, an elastic search cluster fault analysis module, an expert system module and a model optimization and self-learning module, wherein the current log data and monitoring index data of the elastic search clusters are subjected to deep analysis through a fault detection model in the elastic search cluster fault analysis module, the prediction fault type of the elastic search clusters is determined, so that potential fault risks are predicted, automatic detection and prediction of faults of the elastic search clusters are realized, dependence on manual experience is reduced, detection efficiency is greatly improved, operation staff is helped to take measures in advance, fault occurrence is avoided, complex and dynamic fault scenes can be efficiently processed, misinformation and missing report rate are reduced, detection accuracy is improved, and the fault detection model can realize continuous learning and dynamic optimization through the model optimization and self-learning module, so that the fault detection model can keep high-level fault detection capability continuously, and operation cost is reduced.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a framework of an elastic search cluster fault intelligent detection system provided by the invention;
FIG. 2 is a schematic diagram of the module relationship of the elastic search cluster fault intelligent detection system provided by the invention;
FIG. 3 is a schematic flow chart of the intelligent detection method for failure of the elastic search cluster;
fig. 4 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In order to facilitate a clearer understanding of various embodiments of the present application, some relevant knowledge will be presented first.
Currently, failure detection of an elastiscearch cluster relies primarily on traditional monitoring tools and manual troubleshooting. These methods typically include (1) performance monitoring and log analysis, (2) visualization tool application, and (3) real-time monitoring and history.
The method can provide real-time performance monitoring and abnormal response, and can visually check and analyze the cluster state by combining with the visualization tool, so that the method has stronger expandability and can process large-scale logs and monitoring data.
However, these methods also have disadvantages:
(1) Relying on manual analysis, the efficiency is lower:
for complex fault scenes, manual experience is relied on for analysis and investigation, the efficiency is low, and misjudgment is easy to occur;
(2) Failure to predict and prevent potential failure:
the traditional monitoring method mainly aims at the occurred faults, lacks intelligent and automatic prediction capability, and is difficult to prevent potential fault risks in time;
(3) The operation and maintenance complexity is high:
Multiple independent monitoring and analysis tools are required to be configured and maintained, so that the complexity and the operation and maintenance cost of the system are increased;
(4) Processing capacity bottlenecks-in large-scale clusters, performance monitoring and log analysis can face bottlenecks in data volume and processing capacity, and it is difficult to respond and handle failures in time.
In view of the above problems, the present invention provides an intelligent fault detection system for an elastic search cluster, which is used for realizing intelligent, automatic and accurate fault detection for the elastic search cluster.
The elastic search cluster fault intelligent detection system provided by the invention is specifically described below with reference to fig. 1. Fig. 1 is a schematic diagram of a framework of an elastic search cluster fault intelligent detection system provided by the present invention, and referring to fig. 1, the system includes:
The system comprises a data acquisition module 101, an elastic search cluster fault analysis module 102, an expert system module 103 and a model optimization and self-learning module 104, wherein:
The data acquisition module 101 is configured to acquire current log data and monitoring index data of the elastic search cluster.
In the embodiment of the present invention, the data acquisition module 101 is connected to the elastic search cluster fault analysis module 102, and is responsible for collecting current log data and monitoring index data from the elastic search cluster, and transmitting the current log data and monitoring index data to the elastic search cluster fault analysis module 102 for subsequent deep analysis and fault detection.
In practical applications, in order to provide a reliable data basis for the elastic search cluster fault analysis module 102, the data acquisition module 101 needs to collect and sort the monitoring index data and log data of the elastic search cluster in real time.
The monitoring index data comprise functional indexes and performance indexes, wherein the performance indexes comprise data such as CPU utilization rate, memory occupation, query delay and the like, and the log data comprise data such as error logs and warning logs.
The failure analysis module 102 of the elastic search cluster is configured to extract feature vectors of the current log data and the monitoring index data by using a natural language processing NLP algorithm, input the feature vectors to a failure detection model, obtain probability distribution of each failure type output by the failure detection model, and determine a predicted failure type of the elastic search cluster, where the failure detection model is a deep learning-based large-scale pre-training language model, and the failure detection model is obtained by training based on history log data and monitoring index data marked as normal, history log data and monitoring index data marked as failure, and related failure types included in a training data set.
In the embodiment of the present invention, the elastic search cluster fault analysis module 102 is configured to implement intelligent detection and prediction of faults, so as to reduce manual intervention. The module performs feature extraction and depth analysis on the monitoring index data and log data provided by the data acquisition module 101 based on the fault detection model, identifying potential faults and anomalies.
In practical application, the fault detection model is a customized large-scale pre-training language model, such as a general meaning thousand-to-Qwen model, the model is deeply combined with the data of the elastic search cluster, including log data, monitoring index data and historical fault cases for training and optimizing, and the log structure, the monitoring index data and the common fault modes of the elastic search cluster are adapted to ensure the effectiveness and the practicability of the model.
The elastic search cluster fault analysis module 102 is connected to the expert system module 103, and is configured to communicate the predicted fault type to the expert system module 103, and to be verified and optimized by the expert system module 103.
Optionally, the fault type includes at least one of:
index fragmentation, query delay, central Processing Unit (CPU) overload, memory overflow, main node failure, data node failure, network partition failure.
The expert system module 103 is configured to verify the predicted fault type according to the current log data, the monitoring index data, and an expert rule base, and determine a target fault type of the elastic search cluster according to a verification result, where the expert rule base includes a plurality of rules, and each rule includes a fault type, an elastic search cluster fault symptom, and a solution.
In the embodiment of the present invention, in order to improve the accuracy and practicality of fault detection, an expert system module 103 is introduced. Based on the expert rule base, the expert system module 103 verifies and optimizes the predicted fault type output by the fault detection model, so as to determine the target fault type and improve the reliability under the complex fault scene. Each rule in the expert rule base defines a particular fault type, fault symptom, and corresponding solution. The rule base is deployed in a local environment, the specificity of the actual application scene is fully considered, and the effectiveness and applicability of the rule base are ensured.
The module fuses the operation and maintenance experience of the field expert, audits the detection result of the elastic search cluster fault analysis module 102, provides optimization suggestions, improves the detection accuracy, and plays an important role in complex or novel fault scenes.
It should be noted that, the expert system module 103 may transmit the optimized feedback information to the model optimization and self-learning module 104, so as to update the fault detection model.
The model optimization and self-learning module 104 is configured to perform model training on the fault detection model according to the current log data and the monitoring index data, the predicted fault type and the target fault type.
In order to maintain sensitivity and high detection accuracy to the latest fault mode, the elastic search cluster fault intelligent detection system is provided with a model optimization and self-learning module 104, so that continuous optimization and self-learning of a fault detection model are realized.
The specific implementation steps are as follows:
(1) Continuous data collection and preprocessing
In the running process of the system, new log data and monitoring index data are continuously collected. The newly collected data is standardized and extracted in characteristics, so that the newly collected data meets the requirement of model training;
(2) Incremental learning and parameter optimization
And performing incremental learning on the fault detection model by utilizing the new data, learning new fault characteristics, and improving the detection capability of the novel fault. According to the performance of the fault detection model on new data, dynamically adjusting model parameters, and improving generalization capability and robustness of the model;
(3) Expert feedback fusion
Feedback provided by expert system module 103 and new fault rules are incorporated into the model training dataset. Through retraining, misjudgment of the model on a specific fault type is corrected, and detection accuracy is improved;
(4) Model update and deployment
And updating the optimized fault detection model into the elastic search cluster fault analysis module 102, so as to improve the detection capability of the fault detection model. And forming a closed loop of data collection, model optimization and model deployment, and continuously improving the system performance.
The optimization and self-learning module 104 is described in detail in the following embodiments, and is not described herein.
In the embodiment of the invention, the model optimization and self-learning module 104 has the capability of continuous learning and dynamic adaptation, keeps the sensitivity to new faults, and improves the long-term effectiveness of the system. The module performs incremental learning and dynamic optimization on the fault detection model by continuously collecting new data and expert feedback, thereby forming a closed loop. Each module in the intelligent detection system for the failure of the elastic search cluster realizes close cooperation through a data transmission and feedback mechanism, and ensures the efficient performance of the failure detection and optimization workflow of the elastic search cluster.
The intelligent detection system for the faults of the elastic search clusters comprises a data acquisition module, an elastic search cluster fault analysis module, an expert system module and a model optimization and self-learning module, wherein the fault detection module in the elastic search cluster fault analysis module can conduct deep analysis on current log data and monitoring index data of the elastic search clusters to determine the predicted fault types of the elastic search clusters, so that potential fault risks are predicted, automatic detection and prediction of faults of the elastic search clusters are realized, dependence on manual experience is reduced, detection efficiency is greatly improved, operation and maintenance personnel are helped to take measures in advance, fault occurrence is avoided, complex and dynamic fault scenes can be efficiently processed through fusion of a deep learning algorithm and an expert rule base, false alarm and missing report rate are reduced, detection accuracy is improved, and the fault detection module can realize self-learning and dynamic optimization through the model optimization and self-learning, namely, continuous learning and dynamic adaptive capacity are realized, so that the fault detection module can keep high-level fault detection capability, and operation and maintenance cost are reduced.
Optionally, the fault detection model includes an input layer, a convolution layer, a pooling layer, a fully connected layer, and a Softmax layer;
the input layer receives the feature vector, the convolution layer extracts local features in the feature vector through a convolution kernel to identify each fault type, the pooling layer downsamples the local features extracted by the convolution layer to obtain key features, the full-connection layer maps the key features sampled by the pooling layer to each fault type to obtain feature mapping results, and the Softmax layer generates probability distribution of each fault type according to the feature mapping results output by the full-connection layer.
In the embodiment of the invention, a fault detection model adopts a framework of combining a convolutional neural network (Convolutional Neural Networks, CNN) and a fully-connected network. In practical application, the fault detection model generates probability distribution of each fault type, which can be realized by the following steps:
(1) The input layer receives the feature vector, and before the monitoring index data and the log data of the elastic search cluster are input into the input layer of the fault detection model, the monitoring index data and the log data are subjected to standardization and vectorization.
Specifically, for log data, a natural language processing model (e.g., BRET model) is used to translate the log data into a vector representation. The log data includes characteristic information of various faults occurring so that a fault detection model can capture fault modes hidden in the log data.
Aiming at monitoring index data (such as CPU utilization rate, memory occupation, inquiry delay and the like), standardization and dimension reduction processing are required to be carried out, and key features are extracted. The monitoring index data are closely related to the occurrence of faults, and the fault detection model can identify abnormal performance behaviors through analysis.
(2) The convolution layer extracts local features in the feature vector through the convolution kernel, and identifies fault modes and abnormal behaviors in the system log.
(3) The pooling layer downsamples the output of the convolution layer, reduces the data dimension, and retains key features.
(4) The full connection layer further processes and maps key features to different failure types, such as primary node failure, data node failure, network partition, etc.
(5) The Softmax layer converts the feature mapping result output by the full connection layer into probability distribution of various faults, and generates probability distribution of each fault type.
The deep learning and prediction mechanism enables the fault detection model to accurately identify known faults and also has certain unknown fault detection capability.
To ensure the efficiency and accuracy of the fault detection model, it is necessary to continuously optimize it with the model optimization and self-learning module 104 during model training and application.
Specifically, the model optimization and self-learning module 104 is further configured to:
And measuring the difference between the predicted fault type and the target fault type by adopting a cross entropy loss function, and updating parameters of the fault detection model.
The method specifically comprises the following steps:
1) Measuring the difference between the predicted result and the actual result by adopting a cross entropy loss function, and guiding the updating of model parameters;
2) Combining an Adam optimizer and a self-adaptive learning rate strategy, accelerating model convergence and avoiding overfitting;
3) As new fault data is collected, the model is periodically retrained and updated, maintaining sensitivity to the latest fault type.
Optionally, the expert system module 103 is specifically configured to:
Matching and verifying the current log data, the monitoring index data and the predicted fault type with rules of the expert rule base to obtain a verification result;
When the verification result is that a target rule exists in the expert rule base, and the fault type of the target rule and the fault symptom of the elastic search cluster are matched with the predicted fault type, the current log data and the monitoring index data, determining that the target fault type of the elastic search cluster is the predicted fault type;
And when the verification result is that the target rule matched with the current log data, the monitoring index data and the predicted fault type does not exist in the expert rule base, determining the fault type associated with the fault symptom of the elastic search cluster in the expert rule base as the target fault type of the elastic search cluster according to the fault symptom of the elastic search cluster corresponding to the current log data and the monitoring index data in the expert rule base.
In the embodiment of the invention, in order to improve the accuracy and the practicability of fault detection, an expert system module 103 is introduced into an elastic search cluster fault intelligent detection system. By systemizing the long-term accumulated operation and maintenance experience of the domain expert, an independent expert rule base is established, and the expert rule base and the elastic search cluster fault analysis module 102 are jointly involved in the decision process of fault detection. The specific implementation flow is as follows:
(1) Expert rule base construction and localization deployment:
Summarizing and systemizing the operation and maintenance experience of the field expert, and establishing a rule base covering common fault types. Each rule defines a particular fault type, symptom, and corresponding solution. The rule base is deployed in a local environment, the specificity of the actual application scene is fully considered, and the effectiveness and applicability of the rule base are ensured.
(2) And (3) real-time verification and optimization:
after the elastic search cluster fault analysis module 102 outputs the predicted fault type, the expert system module 103 calls the expert rule base in real time to verify the predicted fault type, and confirms the fault type through rule matching. The expert rule base comprises a plurality of rules, and each rule comprises a fault type, an elastic search cluster fault symptom and a corresponding solution.
And in the first situation, matching and verifying the predicted fault type output by the pre-log data, the monitoring index data and the fault detection model with rules of an expert rule base, and determining that the target fault type of the elastic search cluster is the predicted fault type when the verification result is that the target rules exist in the expert rule base and the fault type and the fault symptom of the elastic search cluster are matched with the predicted fault type, the pre-log data and the monitoring index data.
And in the second case, when the verification result is that the target rule matched with the pre-log data, the monitoring index data and the predicted fault type does not exist in the expert rule base, determining the fault type associated with the fault symptom of the elastic search cluster in the expert rule base as the target fault type of the elastic search cluster according to the fault symptom of the elastic search cluster corresponding to the pre-log data and the monitoring index data in the expert rule base.
(3) Rule base maintenance and updating
In the running process of the system, if a new fault type and expert feedback are found, the rule base needs to be continuously updated and perfected, and the timeliness and effectiveness of the rule base are maintained.
Optionally, the model optimization and self-learning module 104 is further configured to:
and training the fault detection model again according to the updating rule and the newly added rule in the expert rule base. By the measures, the fault detection model can be kept sensitive to the latest fault type.
Optionally, the data acquisition module 101 is further configured to:
The historical log data and monitoring index data, various fault types and solutions are collected and sent to the model optimization and self-learning module 104.
In the embodiment of the present invention, the data acquisition module 101 sends the historical log data, the monitoring index data, various fault types and solutions to the model optimization and self-learning module 104, so that a high-quality training data set generated by the model optimization and self-learning module 104 is used for training the fault detection model, so as to ensure that the fault detection model can learn complex association relations of different fault types, and further accurately identify and predict various complex faults.
In practical application, the training data set includes historical log data and monitoring index data of the elastic search cluster, various fault types, solutions and a plurality of sample rules in the expert rule base, and each sample rule includes a fault type, an elastic search cluster fault symptom and a corresponding solution.
Specifically, the failure types encompass index fragmentation, query latency, CPU overload, memory overflow, primary node failure, data node failure, network partitioning, and the like. The training data set not only contains the occurrence condition, symptom and influence of the fault, but also combines rich manual experience, and ensures the diversity and representativeness of the training data set.
In the data collection process, the monitoring index data (such as CPU utilization, memory occupation, query delay, etc.) and the log data (such as WARN and ERROR log entries) collected by the data collection module 101 provide accurate input features for the subsequent fault detection model through standardized processing, feature extraction and dimension reduction processing.
Optionally, the model optimization and self-learning module 104 is configured to collect historical log data and monitoring index data of the elastic search cluster, and various fault types and solutions, and perform data labeling on each of the historical log data and the monitoring index data to obtain a training data set.
The marking of the history log data and the monitoring index data can be realized by the following steps:
marking the history log data and the monitoring index data as corresponding fault types according to the fault symptoms of the elastic search cluster corresponding to the history log data and the monitoring index data, recording occurrence time, triggering conditions, operation steps and influence degree of the faults on the elastic search cluster and the service corresponding to the history log data and the monitoring index data marked as the faults, and generating the training data set.
In the process of marking the historical log data and the monitoring index data, firstly, marking the historical log data and the monitoring index data as specific fault types according to the fault characteristics, wherein the fault types comprise index fragmentation, query delay, CPU overload, memory overflow, main node faults, data node faults, network partitions and the like. And then, recording the occurrence time, the triggering condition and the operation step of each fault sample in detail and the influence degree of the fault on the cluster and the service to form labels of the occurrence condition and the influence of the fault. To ensure consistency and accuracy of the annotation, the initially annotated data will be audited and corrected by the domain expert, and agreed upon by expert discussion for the problematic or controversial samples.
Optionally, the model optimization and self-learning module 104 is further configured to:
And according to the newly-added historical log data and the monitoring index data, periodically performing model retraining and model updating on the fault detection model.
In the intelligent detection system for the failure of the elastic search cluster, which is provided by the embodiment of the invention, the failure detection model is kept to be high in sensitivity to a new failure type by being tightly combined with the model optimization and self-learning module 104, so that the intelligent and automatic detection and prediction of the failure of the elastic search cluster are realized, and the failure detection efficiency and the accurate failure detection capability are further improved.
Fig. 2 is a schematic block diagram of the elastic search cluster fault intelligent detection system provided by the invention. On the basis of the data acquisition module 101, the elastic search cluster fault analysis module 102, the expert system module 103 and the model optimization and self-learning module 104 shown in fig. 1, fig. 2 further includes a system monitoring and feedback module 105, wherein:
The system monitoring and feedback module 105 is connected to the data acquisition module 101, the elastic search cluster fault analysis module 102, the expert system module 103, and the model optimization and self-learning module 104, and is configured to monitor the operation conditions of the data acquisition module 101, the elastic search cluster fault analysis module 102, the expert system module 103, and the model optimization and self-learning module 104.
In the embodiment of the present invention, the system monitoring and feedback module 105 runs through the whole system, monitors the running state and performance of each module, coordinates the data flow and the information flow, is responsible for monitoring the state of the whole system, controls the behavior of other modules through a feedback mechanism, and determines that the system runs stably. The modules realize close cooperation through the system monitoring and feedback module 105, and ensure the efficient performance of fault detection and optimization workflow of the elastic search cluster.
In summary, the invention provides an elastic search cluster fault intelligent detection system based on data set construction and fusion of a localization intelligent model and expert experience, aiming at the problems that the prior art relies on manual analysis, potential faults cannot be predicted, operation and maintenance complexity is high and the like in fault detection, and the system has the following advantages:
1. Improving the accuracy and efficiency of fault detection
(1) Intelligent and automatic. And the customized large language model is utilized to carry out deep analysis on a large amount of logs and monitoring index data, so that automatic detection and prediction of faults are realized, dependence on manual experience is reduced, and detection efficiency is greatly improved.
(2) And accurately identifying complex faults. By combining deep learning with expert experience, the model can accurately identify complex and dynamic fault scenes, reduce false alarm and missing report rate and improve detection accuracy.
2. Having fault prediction and prevention capabilities
(1) Potential faults are early warned in advance. The model can analyze historical data and real-time indexes, predicts potential fault risks, helps operation and maintenance personnel to take measures in advance, and avoids faults.
(2) Dynamically adapting to environmental changes. By self-learning and dynamic optimization of the model, the system can adapt to the changes of cluster configuration and load, and continuously maintain high-level detection capability.
3. Reducing operation and maintenance complexity and cost
(1) And (5) functional integration. The system integrates the functions of data collection, intelligent analysis, expert system and the like, reduces the dependence on a plurality of independent monitoring and analysis tools, and reduces the complexity of operation and maintenance.
(2) And the manual intervention is reduced. The automatic detection and prediction mechanism remarkably reduces the workload of manual investigation, reduces the operation and maintenance cost and improves the operation and maintenance efficiency.
4. Improving stability and expandability of system
(1) And large-scale data is processed efficiently. By utilizing an advanced deep learning algorithm, the system can efficiently process the log and performance data of the large-scale cluster, and the bottleneck of the traditional method in data quantity and processing capacity is overcome.
(2) And (5) modular design. The system has clear structure, definite connection relation between modules, easy deployment, maintenance and expansion and capability of adapting to the increase of service demands.
The method for intelligently detecting the failure of the elastic search cluster, which is provided by the invention, is described below, and the method for intelligently detecting the failure of the elastic search cluster, which is described below, and the system for intelligently detecting the failure of the elastic search cluster, which is described above, can be correspondingly referred to each other. Fig. 3 is a schematic flow chart of the method for intelligently detecting an elastic search cluster fault, as shown in fig. 3, where an execution body of the method for intelligently detecting an elastic search cluster fault is an intelligent detection system for an elastic search cluster fault, and specifically includes steps 301 to 304, where:
Step 301, collecting current log data and monitoring index data of an elastic search cluster.
And 302, extracting feature vectors of current log data and monitoring index data by using an NLP algorithm, inputting the feature vectors into a fault detection model to obtain probability distribution of each fault type output by the fault detection model, and determining the predicted fault type of an elastic search cluster, wherein the fault detection model is a deep learning-based large-scale pre-training language model, and is obtained by training based on history log data and monitoring index data marked as normal, history log data and monitoring index data marked as faults and associated fault types included in a training data set.
And 303, verifying the predicted fault type according to the current log data, the monitoring index data and an expert rule base, and determining the target fault type of the elastic search cluster according to a verification result, wherein the expert rule base comprises a plurality of rules, and each rule comprises the fault type, the fault symptom of the elastic search cluster and a solution.
And step 304, performing model training on the fault detection model according to the current log data, the monitoring index data, the predicted fault type and the target fault type.
The intelligent detection method for the faults of the elastic search clusters reduces the complexity of operation and maintenance, simultaneously, a customized fault detection model can carry out deep analysis on the current log data and the monitoring index data of the elastic search clusters, and the predicted fault type of the elastic search clusters is determined, so that potential fault risks are predicted, automatic detection and prediction of the faults of the elastic search clusters are realized, dependence on manual experience is reduced, the detection efficiency is greatly improved, operation and maintenance personnel are helped to take measures in advance, faults are avoided, moreover, complex and dynamic fault scenes can be efficiently processed, false alarm and missing report rate are reduced, the detection accuracy is improved, and meanwhile, the fault detection model can realize self-learning and dynamic optimization, namely has continuous learning and dynamic adaptability, so that the fault detection model can continuously maintain high-level fault detection capability, and the complexity and cost of operation and maintenance are reduced.
Fig. 4 illustrates a physical schematic diagram of an electronic device, as shown in fig. 4, which may include a processor (processor) 410, a communication interface (Communications Interface) 420, a memory (memory) 430, and a communication bus 440, where the processor 410, the communication interface 420, and the memory 430 perform communication with each other through the communication bus 440. The processor 410 may call logic instructions in the memory 430 to execute an intelligent detection method of an elastic search cluster fault, where the method includes collecting current log data and monitoring index data of the elastic search cluster, extracting feature vectors of the current log data and the monitoring index data by using an NLP algorithm, inputting the feature vectors into a fault detection model to obtain probability distribution of each fault type output by the fault detection model, determining a predicted fault type of the elastic search cluster, where the fault detection model is a deep learning-based large-scale pre-training language model, the fault detection model is obtained by training based on historical log data and monitoring index data marked as normal, historical log data and monitoring index data marked as fault and related fault types included in a training data set, verifying the predicted fault type according to the current log data and the monitoring index data and an expert rule base, determining a target fault type of the elastic search cluster according to a verification result, and the expert rule base includes a plurality of rules, each rule includes a fault type, an elastic search cluster fault symptom and a solution, and a target fault type is performed on the current log data and the target fault type.
Further, the logic instructions in the memory 430 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.
On the other hand, the invention also provides a computer program product, which comprises a computer program, wherein the computer program can be stored on a non-transitory computer readable storage medium, when the computer program is executed by a processor, the computer can execute the intelligent detection method for the faults of the elastic search clusters, which is provided by the methods, and comprises the steps of collecting current log data and monitoring index data of the elastic search clusters, extracting feature vectors of the current log data and the monitoring index data by using an NLP algorithm, inputting the feature vectors into a fault detection model to obtain probability distribution of each fault type output by the fault detection model, determining a predicted fault type of the elastic search clusters, wherein the fault detection model is a deep learning-based large-scale pre-training language model, the fault detection model is obtained by training historical log data and monitoring index data which are marked as normal and are included in a training data set, the historical log data and the monitoring index data marked as faults and associated fault types, verifying the predicted fault types according to the current log data and the monitoring index data and an expert rule base, and determining the predicted fault types according to the fault types of the fault detection model, and the fault detection rules of the fault detection model comprises the fault type and the fault type of the fault detection model.
In still another aspect, the invention further provides a non-transitory computer readable storage medium, on which a computer program is stored, the computer program being implemented when executed by a processor to perform the method for intelligently detecting an elastic search cluster fault provided by the above methods, the method comprising collecting current log data and monitoring index data of the elastic search cluster, extracting feature vectors of the current log data and the monitoring index data by using an NLP algorithm, inputting the feature vectors into a fault detection model to obtain probability distribution of each fault type output by the fault detection model, determining a predicted fault type of the elastic search cluster, wherein the fault detection model is a deep learning-based large-scale pre-training language model, the fault detection model is obtained by training based on historical log data and monitoring index data marked as normal included in a training data set, historical log data and monitoring index data marked as faults and associated fault types, verifying the predicted fault types according to the current log data and the monitoring index data and an expert rule base, determining a target fault type of the elastic search cluster according to a verification result, and wherein the fault detection model comprises a plurality of expert rule sets and fault rule bases, and fault rule sets are included in the fault detection model, and fault rule sets are used for solving the fault types.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present invention.

Claims (10)

1.一种Elasticsearch集群故障智能化检测系统,其特征在于,包括数据采集模块、Elasticsearch集群故障分析模块、专家系统模块及模型优化与自我学习模块;其中:1. An Elasticsearch cluster fault intelligent detection system, characterized by comprising a data acquisition module, an Elasticsearch cluster fault analysis module, an expert system module and a model optimization and self-learning module; wherein: 所述数据采集模块,用于采集Elasticsearch集群的当前日志数据和监控指标数据;The data collection module is used to collect current log data and monitoring indicator data of the Elasticsearch cluster; 所述Elasticsearch集群故障分析模块,用于使用自然语言处理NLP算法提取所述当前日志数据和监控指标数据的特征向量,将所述特征向量输入至故障检测模型,得到所述故障检测模型输出的各故障类型的概率分布,确定所述Elasticsearch集群的预测故障类型;其中,所述故障检测模型为基于深度学习的大规模预训练语言模型,所述故障检测模型是基于训练数据集中包括的标注为正常的历史日志数据和监控指标数据、标注为故障的历史日志数据和监控指标数据及关联的故障类型进行训练得到的;The Elasticsearch cluster fault analysis module is used to extract feature vectors of the current log data and monitoring indicator data using a natural language processing (NLP) algorithm, input the feature vectors into a fault detection model, obtain the probability distribution of each fault type output by the fault detection model, and determine the predicted fault type of the Elasticsearch cluster; wherein the fault detection model is a large-scale pre-trained language model based on deep learning, and the fault detection model is trained based on historical log data and monitoring indicator data marked as normal, historical log data and monitoring indicator data marked as faults, and associated fault types included in a training data set; 所述专家系统模块,用于根据所述当前日志数据和监控指标数据及专家规则库对所述预测故障类型进行验证,根据验证结果确定所述Elasticsearch集群的目标故障类型;其中,所述专家规则库包括多个规则,每个规则包括故障类型、Elasticsearch集群故障症状及解决方案;The expert system module is used to verify the predicted fault type according to the current log data, monitoring indicator data and expert rule base, and determine the target fault type of the Elasticsearch cluster according to the verification result; wherein the expert rule base includes multiple rules, each rule includes a fault type, Elasticsearch cluster fault symptoms and solutions; 所述模型优化与自我学习模块,用于根据所述当前日志数据和监控指标数据、所述预测故障类型和所述目标故障类型,对所述故障检测模型进行模型训练。The model optimization and self-learning module is used to perform model training on the fault detection model based on the current log data and monitoring indicator data, the predicted fault type and the target fault type. 2.根据权利要求1所述的Elasticsearch集群故障智能化检测系统,其特征在于,所述专家系统模块,具体用于:2. The Elasticsearch cluster fault intelligent detection system according to claim 1, characterized in that the expert system module is specifically used for: 将所述当前日志数据和监控指标数据及所述预测故障类型与所述专家规则库的规则进行匹配验证,得到验证结果;Matching and verifying the current log data, monitoring indicator data and the predicted fault type with the rules of the expert rule base to obtain a verification result; 在所述验证结果为所述专家规则库中存在目标规则,且所述目标规则的故障类型和Elasticsearch集群故障症状与所述预测故障类型及所述当前日志数据和监控指标数据匹配时,确定所述Elasticsearch集群的目标故障类型为所述预测故障类型;When the verification result is that the target rule exists in the expert rule base, and the fault type and the Elasticsearch cluster fault symptom of the target rule match the predicted fault type and the current log data and the monitoring indicator data, determining that the target fault type of the Elasticsearch cluster is the predicted fault type; 在所述验证结果为所述专家规则库中不存在与所述当前日志数据和监控指标数据及所述预测故障类型匹配的目标规则时,根据所述专家规则库中与所述当前日志数据和监控指标数据对应的Elasticsearch集群故障症状,将所述专家规则库中与所述Elasticsearch集群故障症状关联的故障类型确定为所述Elasticsearch集群的目标故障类型。When the verification result is that there is no target rule in the expert rule base that matches the current log data and monitoring indicator data and the predicted fault type, based on the Elasticsearch cluster fault symptoms corresponding to the current log data and monitoring indicator data in the expert rule base, the fault type associated with the Elasticsearch cluster fault symptom in the expert rule base is determined as the target fault type of the Elasticsearch cluster. 3.根据权利要求1所述的Elasticsearch集群故障智能化检测系统,其特征在于,所述故障检测模型包括输入层、卷积层、池化层、全连接层和Softmax层;3. The Elasticsearch cluster fault intelligent detection system according to claim 1, characterized in that the fault detection model includes an input layer, a convolution layer, a pooling layer, a fully connected layer and a Softmax layer; 其中,所述输入层接收所述特征向量;所述卷积层通过卷积核提取所述特征向量中的局部特征,识别各故障类型;所述池化层对所述卷积层提取的局部特征进行下采样,得到关键特征;所述全连接层将所述池化层采样的关键特征映射到各所述故障类型,得到特征映射结果;所述Softmax层根据所述全连接层输出的特征映射结果,生成各所述故障类型的概率分布。Among them, the input layer receives the feature vector; the convolution layer extracts local features in the feature vector through the convolution kernel to identify each fault type; the pooling layer downsamples the local features extracted by the convolution layer to obtain key features; the fully connected layer maps the key features sampled by the pooling layer to each fault type to obtain a feature mapping result; the Softmax layer generates a probability distribution of each fault type according to the feature mapping result output by the fully connected layer. 4.根据权利要求1所述的Elasticsearch集群故障智能化检测系统,其特征在于,所述数据采集模块,还用于:4. The Elasticsearch cluster fault intelligent detection system according to claim 1, characterized in that the data acquisition module is also used for: 收集所述历史日志数据和监控指标数据、各种故障类型及解决方案,发送至所述模型优化与自我学习模块。The historical log data and monitoring indicator data, various fault types and solutions are collected and sent to the model optimization and self-learning module. 5.根据权利要求4所述的Elasticsearch集群故障智能化检测系统,其特征在于,所述模型优化与自我学习模块,还用于:5. The Elasticsearch cluster fault intelligent detection system according to claim 4, characterized in that the model optimization and self-learning module is also used for: 根据所述历史日志数据和监控指标数据对应的Elasticsearch集群故障症状,将所述历史日志数据和监控指标数据标注为对应的故障类型,并记录每个标注为故障的历史日志数据和监控指标数据对应的发生时间、触发条件、操作步骤、以及故障对Elasticsearch集群和业务的影响程度,生成所述训练数据集。According to the Elasticsearch cluster failure symptoms corresponding to the historical log data and monitoring indicator data, the historical log data and monitoring indicator data are marked as corresponding failure types, and the occurrence time, triggering conditions, operation steps, and impact of the failure on the Elasticsearch cluster and business corresponding to each historical log data and monitoring indicator data marked as a failure are recorded to generate the training data set. 6.根据权利要求4所述的Elasticsearch集群故障智能化检测系统,其特征在于,所述模型优化与自我学习模块,还用于:6. The Elasticsearch cluster fault intelligent detection system according to claim 4, characterized in that the model optimization and self-learning module is also used for: 根据新增的历史日志数据和监控指标数据,定期地对所述故障检测模型进行模型再训练和模型更新。According to the newly added historical log data and monitoring indicator data, the fault detection model is regularly retrained and updated. 7.根据权利要求1所述的Elasticsearch集群故障智能化检测系统,其特征在于,所述模型优化与自我学习模块,还用于:7. The Elasticsearch cluster fault intelligent detection system according to claim 1, characterized in that the model optimization and self-learning module is also used for: 根据所述专家规则库中的更新规则及新增规则,对所述故障检测模型进行模型再训练。The fault detection model is retrained according to the updated rules and newly added rules in the expert rule base. 8.根据权利要求1所述的Elasticsearch集群故障智能化检测系统,其特征在于,所述模型优化与自我学习模块,还用于:8. The Elasticsearch cluster fault intelligent detection system according to claim 1, wherein the model optimization and self-learning module is further used for: 采用交叉熵损失函数衡量所述预测故障类型与所述目标故障类型之间的差异,对所述故障检测模型进行参数更新。A cross entropy loss function is used to measure the difference between the predicted fault type and the target fault type, and the parameters of the fault detection model are updated. 9.根据权利要求1所述的Elasticsearch集群故障智能化检测系统,其特征在于,所述故障类型包括以下至少一项:9. The Elasticsearch cluster fault intelligent detection system according to claim 1, wherein the fault type includes at least one of the following: 索引碎片化、查询延迟、中央处理器CPU过载、内存溢出、主节点故障、数据节点故障、网络分区故障。Index fragmentation, query delay, CPU overload, memory overflow, master node failure, data node failure, network partition failure. 10.根据权利要求4至6中任一项所述的Elasticsearch集群故障智能化检测系统,其特征在于,所述系统还包括:10. The Elasticsearch cluster fault intelligent detection system according to any one of claims 4 to 6, characterized in that the system further comprises: 系统监控与反馈模块,与所述数据采集模块、所述Elasticsearch集群故障分析模块、所述专家系统模块及所述模型优化与自我学习模块,用于监控所述数据采集模块、所述Elasticsearch集群故障分析模块、所述专家系统模块及所述模型优化与自我学习模块的运行情况。The system monitoring and feedback module, together with the data acquisition module, the Elasticsearch cluster fault analysis module, the expert system module and the model optimization and self-learning module, is used to monitor the operation of the data acquisition module, the Elasticsearch cluster fault analysis module, the expert system module and the model optimization and self-learning module.
CN202411575774.7A 2024-11-06 2024-11-06 Elasticsearch cluster fault intelligent detection system Pending CN119248617A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411575774.7A CN119248617A (en) 2024-11-06 2024-11-06 Elasticsearch cluster fault intelligent detection system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411575774.7A CN119248617A (en) 2024-11-06 2024-11-06 Elasticsearch cluster fault intelligent detection system

Publications (1)

Publication Number Publication Date
CN119248617A true CN119248617A (en) 2025-01-03

Family

ID=94022537

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411575774.7A Pending CN119248617A (en) 2024-11-06 2024-11-06 Elasticsearch cluster fault intelligent detection system

Country Status (1)

Country Link
CN (1) CN119248617A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119415221A (en) * 2025-01-06 2025-02-11 北京壁仞科技开发有限公司 Active testing methods, devices, equipment, media and products for AI computing clusters

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119415221A (en) * 2025-01-06 2025-02-11 北京壁仞科技开发有限公司 Active testing methods, devices, equipment, media and products for AI computing clusters

Similar Documents

Publication Publication Date Title
CN111210024B (en) Model training method, device, computer equipment and storage medium
CN112953629B (en) Method and system for analyzing uncertainty of optical network fault prediction
CN112966714B (en) Edge time sequence data anomaly detection and network programmable control method
CN115412947B (en) Fault simulation method and system based on digital twin and AI algorithm
CN117743909A (en) Heating system fault analysis method and device based on artificial intelligence
CN115508672B (en) Fault traceability reasoning method, system, equipment and medium of power grid main equipment
CN119248617A (en) Elasticsearch cluster fault intelligent detection system
CN117527622B (en) Data processing method and system of network switch
CN119066541A (en) New energy station monitoring data quality evaluation method and system based on multi-source data
CN115269314A (en) Transaction abnormity detection method based on log
CN119728397B (en) Network fault prediction method and system
CN117670239A (en) Smart campus data monitoring application system based on the Internet of Things
CN119902905A (en) Server load status evaluation method based on dynamic evaluation algorithm
CN118965247B (en) Power plant data management method and system based on multi-source data
CN117135038A (en) Network fault monitoring methods, devices and electronic equipment
JP7062505B2 (en) Equipment management support system
CN119557134A (en) Fault handling method, device, electronic device and storage medium for cloud computing platform
KR102572192B1 (en) Auto Encoder Ensemble Based Anomaly Detection Method and System
CN119577681A (en) Intelligent management method of industrial equipment data based on big data algorithm
CN117520040A (en) Micro-service fault root cause determining method, electronic equipment and storage medium
CN119761781B (en) Health management method of thermal power generation equipment based on knowledge graph and mechanism model
CN109961171A (en) A capacitor fault prediction method based on machine learning and big data analysis
CN119691576A (en) Self-adaptive operation and maintenance root cause positioning method and system based on deep learning
CN118863859A (en) Power grid fault alarm method, device, storage medium and system
Wang et al. Mining Method of SER Fault Event Set in Converter Station Based on Improved Association Rule Algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination