CN107391335B - Method and equipment for checking health state of cluster - Google Patents
Method and equipment for checking health state of cluster Download PDFInfo
- Publication number
- CN107391335B CN107391335B CN201710205541.1A CN201710205541A CN107391335B CN 107391335 B CN107391335 B CN 107391335B CN 201710205541 A CN201710205541 A CN 201710205541A CN 107391335 B CN107391335 B CN 107391335B
- Authority
- CN
- China
- Prior art keywords
- check
- cluster
- updated
- rule
- monitoring data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000036541 health Effects 0.000 title description 84
- 238000000034 method Methods 0.000 title description 44
- 238000012544 monitoring process Methods 0.000 description 197
- 230000002159 abnormal effect Effects 0.000 description 142
- 230000015654 memory Effects 0.000 description 46
- 230000005856 abnormality Effects 0.000 description 42
- 238000007689 inspection Methods 0.000 description 41
- 230000003862 health status Effects 0.000 description 32
- 230000002776 aggregation Effects 0.000 description 12
- 238000004220 aggregation Methods 0.000 description 12
- 230000008859 change Effects 0.000 description 12
- 238000012423 maintenance Methods 0.000 description 12
- 238000004364 calculation method Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 9
- 230000004931 aggregating effect Effects 0.000 description 4
- 238000004590 computer program Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 230000010365 information processing Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000005291 magnetic effect Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000003936 working memory Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3006—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0677—Localisation of faults
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/14—Network analysis or design
- H04L41/147—Network analysis or design for predicting network behaviour
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Computing Systems (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- Quality & Reliability (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The purpose of the application is to provide a method and equipment for checking the health status of a cluster, by obtaining the relevant information of the cluster to be checked; acquiring at least one problem to be checked and a check rule corresponding to the problem; acquiring monitoring data of a check point related to the check rule from the cluster based on the related information of the cluster, and performing aggregation processing on the monitoring data to obtain a processing result; the method and the device have the advantages that the problems corresponding to the processing results are called based on the processing results, and health early warning information is generated and fed back based on the relevant information of the problems, so that the health conditions of multiple check points corresponding to the problems are monitored when the problems occur, the accuracy of pre-judging the health conditions of the check points corresponding to the problems in the cluster is improved, the real-time performance of multi-check-point monitoring on the on-line distributed file system is improved, and the aim of giving an alarm by the multiple check points in advance is fulfilled.
Description
Technical Field
The present application relates to the field of computers, and more particularly, to a technique for checking health status of a cluster.
Background
In the Distributed cluster alarm System, along with the sudden increase of mass data of user equipment, the scale of a Distributed File System (Distributed File System) is also continuously increased; however, as the cluster of the distributed file system ages and the service grows continuously, various problems emerge endlessly, and the single-point problems of a single server in a cluster node are likely to accumulate to cause a large fault; however, when a problem occurs suddenly, the alarm system is used to alarm, so as to wake up maintenance personnel to investigate and execute a method for solving the problem, and a fault may be caused because the best time for solving the problem is missed.
In the prior art, a distributed cluster alarm system performs single-point alarm on hardware (e.g., a memory, a local module in a hard disk or a software entity) and an operating system of a single service device under each cluster node, alarms when a single point fails, and uniformly alarms a large amount of alarms to maintenance personnel after the service device obtains simple abnormal alarm information. Because the distributed cluster alarm system in the prior art only alarms when a single point has a problem, a fault can be caused if the alarm threshold value is set too loosely before alarming, and a large amount of false alarms can be caused if the alarm threshold value is set too tightly; in addition, the distributed cluster alarm system in the prior art mainly alarms aiming at the hardware of the service equipment and a single point of an operating system, and does not judge the availability, the performance, the service quality and the like of the distributed file system, so that the whole distributed file system is alarmed one by one, and the alarm accuracy is low; in addition, the distributed cluster alarm system in the prior art simply acquires a large amount of abnormal alarm information and uniformly alarms the abnormal alarm information to maintenance personnel so as to investigate and solve problems by the maintenance personnel, so that the alarm accuracy is low and the real-time performance is poor.
Therefore, in the prior art, a distributed cluster alarm system is adopted to perform single-point alarm on problems of hardware and an operating system of a single service device under each cluster node in a distributed file system, so that the alarm accuracy is low and the real-time performance is poor.
Disclosure of Invention
The application aims to provide a method and equipment for checking cluster health status, so as to solve the problems that in the prior art, a distributed cluster alarm system is adopted to perform single-point alarm on hardware and operating systems of single service equipment under each cluster node in a distributed file system, so that the alarm accuracy is low and the real-time performance is poor.
According to an aspect of the present application, there is provided a method for checking health status of a cluster, comprising:
acquiring related information of a cluster to be checked;
acquiring at least one problem to be checked and a check rule corresponding to the problem;
acquiring monitoring data of a check point related to the check rule from the cluster based on the related information of the cluster, and performing aggregation processing on the monitoring data to obtain a processing result;
and calling the corresponding problem based on the processing result, and generating and feeding back health early warning information based on the relevant information of the problem.
Further, aggregating the monitoring data to obtain a processing result includes:
and respectively processing the monitoring data of each check point based on a check rule corresponding to the problem to be checked so as to obtain at least one check point with abnormal monitoring data and feed back a processing result.
According to an aspect of the present application, there is provided a method for checking health status of a cluster, further comprising:
creating a problem rule base, wherein the problem rule base comprises at least one problem and a corresponding check rule;
and updating the problems in the problem rule base and the corresponding check rules.
Further, updating the questions in the question rule base and the corresponding check rules thereof includes:
acquiring relevant information of a cluster to be checked, a problem to be updated and an initial monitoring threshold value of the problem to be updated;
based on the initial monitoring threshold, acquiring the occurrence time point of the problem to be updated and the monitoring data of all the check points in a set time period before the occurrence time point from the related information of the cluster, and determining and recording the abnormal check points based on the monitoring data;
when the problem to be updated occurs in each set time period, updating the occurrence probability of each check point when the problem to be updated occurs based on the check point of the abnormality recorded in the current set time period and the check point of the abnormality recorded in history;
updating the check rule of the problem to be updated based on the updated check point with the occurrence probability higher than the set probability and the relevant information thereof.
According to another aspect of the present application, there is also provided an apparatus for checking health status of a cluster, including:
the information acquisition device is used for acquiring the related information of the cluster to be checked;
the rule acquisition device is used for acquiring at least one problem to be checked and a corresponding check rule;
the monitoring processing device is used for acquiring monitoring data of a check point related to the check rule from the cluster based on the related information of the cluster, and performing aggregation processing on the monitoring data to obtain a processing result;
and the early warning feedback device is used for calling the corresponding problem based on the processing result and generating and feeding back health early warning information based on the relevant information of the problem.
Further, the monitoring processing device includes:
and the data processing unit is used for respectively processing the monitoring data of each check point based on the check rule corresponding to the problem to be checked so as to obtain at least one check point with monitoring data abnormity and feed back a processing result.
According to an aspect of the present application, there is provided an apparatus for checking health status of a cluster, further comprising:
the system comprises a creating rule device, a checking rule device and a processing device, wherein the creating rule device is used for creating a question rule base, and the question rule base comprises at least one question and a corresponding checking rule;
and the rule updating device is used for updating the problems in the problem rule base and the corresponding check rules.
Further, the rule updating device includes:
the first information acquisition unit is used for acquiring relevant information of a cluster to be checked, a problem to be updated and an initial monitoring threshold value of the problem to be updated;
a first recording unit, configured to obtain, based on the initial monitoring threshold, the occurrence time point of the problem to be updated and monitoring data of all the check points in a set time period before the occurrence time point from related information of the cluster, and determine and record the abnormal check point based on the monitoring data;
a first probability updating unit configured to update, when the problem to be updated occurs within each of the set time periods, an occurrence probability of each of the checkpoints at the time of occurrence of the problem to be updated based on the checkpoints of the anomalies recorded within the current set time period and the checkpoints of the anomalies recorded in history;
and the first rule updating unit is used for updating the check rule of the problem to be updated based on the updated check point with the occurrence probability higher than the set probability and the relevant information thereof.
In addition, the present application also provides an apparatus for checking health status of a cluster, including:
a processor;
and a memory arranged to store computer executable instructions that, when executed, cause the processor to:
acquiring related information of a cluster to be checked;
acquiring at least one problem to be checked and a check rule corresponding to the problem;
acquiring monitoring data of a check point related to the check rule from the cluster based on the related information of the cluster, and performing aggregation processing on the monitoring data to obtain a processing result;
and calling the corresponding problem based on the processing result, and generating and feeding back health early warning information based on the relevant information of the problem.
Compared with the prior art, the method and the equipment for checking the health state of the cluster provided by the embodiment of the application acquire the relevant information of the cluster to be checked; acquiring at least one problem to be checked and a check rule corresponding to the problem; acquiring monitoring data of a check point related to the check rule from the cluster based on the related information of the cluster, and performing aggregation processing on the monitoring data to obtain a processing result; and calling the corresponding problem based on the processing result, and generating and feeding back health early warning information based on the relevant information of the problem. Before the health condition of the distributed file system on the line is prejudged, the problem to be detected which is abnormal on the line of the distributed file system as far as possible is correspondingly regularized to obtain the inspection rule corresponding to the problem to be detected, so that when the health condition of the distributed file system on the line is prejudged, the monitoring data corresponding to each inspection point can be directly obtained, the monitoring data of the inspection points are aggregated by using the inspection rule to obtain a processing result, the accuracy of health condition monitoring on a plurality of inspection points under each cluster node is improved, the corresponding problem is called based on the processing result, health early warning information is generated and fed back based on the relevant information of the problem, and the maintenance personnel can early warn each inspection point which finds the problem under each cluster node in advance and process the relevant health early warning information based on the health early warning information fed back, therefore, the real-time performance of multi-check point monitoring on the on-line distributed file system is improved, and the aim of multi-point alarming in advance is fulfilled; further, aggregating the monitoring data to obtain a processing result includes: and respectively processing the monitoring data of each check point based on the check rule corresponding to the problem to be checked so as to obtain at least one check point with abnormal monitoring data and feed back a processing result, thereby realizing the monitoring of the health condition of a plurality of check points corresponding to the problem and improving the accuracy of pre-judging the health condition of each check point corresponding to the problem in the cluster.
Further, the method and the device for checking the health status of the cluster provided by the embodiments of the present application further create a problem rule base, where the problem rule base includes at least one problem and a check rule corresponding to the problem; the problems in the problem rule base and the corresponding check rules are updated, the creation of the check rules of all check points with problems as far as possible in the on-line distributed file system is ensured, the problems in the problem rule base and the corresponding check rules are updated based on the monitoring data of all the check points, the abnormal check points in the distributed file system can be reflected more comprehensively and more accurately in the created problem rule base, the health condition of the multiple corresponding check points when the problems occur is monitored, and the accuracy and the real-time performance of the health condition prejudgment of all the check points corresponding to the problems in the cluster are improved.
Further, updating the questions in the question rule base and the corresponding check rules thereof includes: acquiring relevant information of a cluster to be checked, a problem to be updated and an initial monitoring threshold value of the problem to be updated; based on the initial monitoring threshold, acquiring the occurrence time point of the problem to be updated and the monitoring data of all the check points in a set time period before the occurrence time point from the related information of the cluster, and determining and recording the abnormal check points based on the monitoring data; when the problem to be updated occurs in each set time period, updating the occurrence probability of each check point when the problem to be updated occurs based on the check point of the abnormality recorded in the current set time period and the check point of the abnormality recorded in history; updating the check rule of the problem to be updated based on the updated check points and the related information thereof, wherein the updated check rule of the problem to be updated is higher than the set probability, so that the problem rule base is updated by updating the check rule of the problem to be updated based on the check points and the related information thereof, wherein the check points and the related information thereof are recorded in the set time period, and the probability of occurrence of each check point when the problem to be updated occurs is updated based on the check points of the abnormality recorded in the set time period and the check points of the abnormality recorded in the history when the problem to be updated occurs in each set time period, so that the check rule of the problem to be updated is updated by updating the check rule of the problem to be updated, the problem rule base can reflect abnormal check points in the distributed file system more comprehensively and accurately, the health conditions of the check points corresponding to the problems are monitored when the problems occur, and the accuracy and the real-time performance of the health condition prejudgment of the check points corresponding to the problems in the cluster are improved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 illustrates a flow diagram of a method for checking health status of a cluster, in accordance with an aspect of the subject application;
FIG. 2 illustrates a flow diagram of a corresponding method for creating a problem rule base in a method for checking cluster health status according to yet another aspect of the subject application;
fig. 3 is a schematic flowchart illustrating a method of step S16 corresponding to a rule of creating a question bank in a method for checking health status of a cluster according to an embodiment of the present application;
fig. 4 is a schematic flowchart illustrating a method of step S16 corresponding to a rule of creating a question bank in a method for checking health status of a cluster according to yet another embodiment of the present application;
FIG. 5 illustrates an apparatus schematic for checking cluster health status according to an aspect of the subject application;
FIG. 6 illustrates an apparatus schematic diagram of a corresponding create problem rule base in an apparatus for checking cluster health status according to yet another aspect of the present application;
fig. 7 is a schematic structural diagram illustrating a rule updating apparatus 16 in a device for checking health status of a cluster according to an embodiment of the present application;
fig. 8 is a schematic structural diagram illustrating a rule updating apparatus 16 in a device for checking health status of a cluster according to yet another embodiment of the present application.
The same or similar reference numbers in the drawings identify the same or similar elements.
Detailed Description
The present application is described in further detail below with reference to the attached figures.
FIG. 1 illustrates a flow diagram of a method for checking cluster health, according to an aspect of the subject application. The method includes step S11, step S12, step S13, and step S14.
Wherein the step S11: acquiring related information of a cluster to be checked; the step S12: acquiring at least one problem to be checked and a check rule corresponding to the problem; the step S13: acquiring monitoring data of a check point related to the check rule from the cluster based on the related information of the cluster, and performing aggregation processing on the monitoring data to obtain a processing result; the step S14: and calling the corresponding problem based on the processing result, and generating and feeding back health early warning information based on the relevant information of the problem.
In an embodiment of the present application, the cluster to be checked in step S11 is located on one or more cluster nodes in a distributed file system, where the distributed file system means that the physical storage resources managed by the file system are not necessarily directly connected to the local node, but are connected to the nodes through a computer network. The following describes the present application in detail with reference to a distributed file system as an example. Of course, the detailed explanation of the specific embodiment of the present application is made by taking a distributed file system as an example, and the embodiment of the present application is only for illustrative purposes, and is not limited thereto, and the following embodiments may be implemented in other distributed cluster systems as well.
Further, the checkpoint comprises at least any one of: hardware devices in the cluster, and local modules of software devices in the cluster.
It should be noted that the checkpoint in step S13 may be a local module including, but not limited to, a hardware device including a single server under each cluster node in the distributed file system and a software device in the distributed file system. The hardware devices of the server include, but are not limited to, a central processing unit, a memory, a hard disk, a chipset, an input/output bus, an input/output device, a power supply, a chassis, and the like, and the local modules of the software device include, but are not limited to, a system setup program module, a fault diagnosis program module, a fault handling program module, and the like. Of course, other existing or future occurrences of such checkpoints, as may be applicable to the present application, are also intended to be encompassed within the scope of the present application and are hereby incorporated by reference.
Further, the step S11 includes: acquiring related information of a cluster to be checked; specifically, the step S11 includes: based on a request submitted by a user, obtaining relevant information of a cluster to be checked, wherein the relevant information comprises: cluster location information and a check period.
In the embodiment of the application, when the health condition of the online distributed file system needs to be monitored, a request submitted by a user is obtained, the cluster position information of a cluster to be checked and the checking time period for monitoring the cluster to be checked are obtained based on the request submitted by the user, wherein the cluster position information and the checking time period both belong to the related information of the cluster to be checked.
For example, when the health condition of the online distributed file system needs to be monitored, a request submitted by a user for monitoring each check point in the cluster is obtained, cluster position information corresponding to the cluster to be inspected is obtained based on the request submitted by the user, and one or more inspection time periods corresponding to the obtained monitoring data of the plurality of check points are obtained, where the cluster position information may be an actual geographical position range where cluster nodes distributed in different areas are located, or may also be an actual geographical position range where cluster nodes in the same area are located.
Further, the step S12 includes: acquiring at least one problem to be checked and a check rule corresponding to the problem; specifically, the step S12 includes: and acquiring at least one problem to be checked and a corresponding check rule from the problem rule base.
It should be noted that the question rule base in step S12 mainly includes the already established question and a plurality of corresponding check rules. The problems include, but are not limited to, memory leak, read-write long tail, data loss, and system performance problems, and system availability problems and quality of service problems; the inspection rules include an exception threshold for a checkpoint and its corresponding monitored data. Of course, other existing or future problem rule bases may be applicable to this application and are intended to be encompassed within the scope of this application and are hereby incorporated by reference.
For example, if the problem in the problem rule base is a memory leak, the corresponding check rule includes: the check point is the change rate of the service pressure of the last week and the corresponding abnormal threshold value thereof, the check point is the total created file amount and the corresponding abnormal threshold value thereof, and the check point is the memory use increasing slope and the corresponding abnormal threshold value thereof; the problem of the problem rule base is reading and writing long tail, and the corresponding check rule comprises the following steps: the check point is the read-write calling frequency of the last week and the corresponding abnormal threshold value thereof, the check point is the retransmission rate of the network in the cluster and the corresponding abnormal threshold value thereof, and the check point is the value information of the health state of the disk in the cluster and the corresponding abnormal threshold value thereof.
Further, the step S13 of obtaining the monitoring data of the checkpoint related to the check rule from the cluster includes: searching the cluster based on the cluster position information, and acquiring a check point related to the check rule in the cluster; and acquiring monitoring data related to the check point in the check time period from a monitoring module of the cluster.
It should be noted that the monitoring module of the cluster is mainly responsible for acquiring monitoring data of each hardware device and each checkpoint related to the software device from the monitoring system in the cluster. Of course, other existing or future monitoring modules of the cluster, as may be applicable to the present application, are also intended to be included within the scope of the present application and are hereby incorporated by reference.
In the above embodiment of the present application, in step S13, if the cluster location information is shanghai geographic location information, the shanghai cluster is found based on the shanghai geographic location information, and each check point related to the check rule is obtained from the shanghai cluster; then, the monitoring data of each relevant check point in the check time period is acquired from the monitoring module of the Shanghai cluster, and if the acquired check points are 34% of the monitoring data of the total created file amount, 48% of the monitoring data of the memory usage increasing slope, 1% of the monitoring data of the change rate of the service pressure of the last week, 75.6% of the monitoring data of the read-write calling frequency of the last week, 5.3% of the monitoring data of the retransmission rate of the network in the cluster, and 15% of the monitoring data of the score information of the disk health state in the cluster.
Further, the aggregating the monitoring data to obtain a processing result in step S13 includes: and respectively processing the monitoring data of each check point based on a check rule corresponding to the problem to be checked so as to obtain at least one check point with abnormal monitoring data and feed back a processing result.
In the above embodiment of the present application, in the step S133, whether the problem to be checked exists may be determined by comparing the monitoring data of the plurality of check points respectively based on the check rule corresponding to the problem to be checked. If the distributed file system on the pre-judgment line has the problem of memory leakage, the total amount of files can be created by respectively comparing the change rate of the service pressure of the last week, and the memory carries out corresponding matching of the inspection rules by using the monitoring data corresponding to the three inspection points of the increased slope so as to obtain a processing result for pre-judgment; if the distributed file system on the pre-judgment line has the problem of long tail reading and writing, the pre-judgment can be carried out by respectively carrying out corresponding matching of the inspection rules on the monitoring data corresponding to three inspection points, namely the reading and writing calling frequency of a week, the retransmission rate of a network in a cluster and the value information of the health state of a disk in the cluster, so as to obtain a processing result.
For example, if in the check rule that the problem to be checked is memory leak, the exception threshold of the checkpoint for the total created file amount is 30, and since the monitoring data 34 of the checkpoint for the total created file amount exceeds the exception threshold 30, the checkpoint gives an exception to the total created file amount; the check point is that the abnormal threshold value of the memory usage increase slope is 20%, since the 48% of the monitoring data of the memory usage increase slope exceeds the abnormal threshold value by 20%, the check point is that the memory usage increase slope is abnormal, the check point is that the abnormal threshold value of the change rate of the service pressure of the last week is 14%, and since the 1% of the monitoring data of the change rate of the service pressure of the last week is less than the 14% of the abnormal threshold value, the change rate of the service pressure of the last week is normal; if the check point is the check rule of the read-write long tail, the abnormal threshold of the read-write calling frequency of the check point for the last week is 30%, since 75.6% of the monitoring data of the read-write calling frequency of the check point for the last week exceeds the abnormal threshold 30%, the read-write calling frequency of the check point for the last week is abnormal, the check point is the abnormal threshold of the retransmission rate of the network in the cluster is 10%, since 5.3% of the monitoring data of the retransmission rate of the network in the cluster is less than 10% of the abnormal threshold, the check point is the normal retransmission rate of the network in the cluster, the check point is the abnormal threshold of the score information of the disk health state in the cluster is 60, and since 15% of the monitoring data of the score information of the disk health state in the cluster is less than 60, the check point is the normal score information of the disk health state in the cluster, therefore, the obtained processing result indicates that the problem to be checked is that the check point in the check rule corresponding to the memory leak is abnormal for the total created file amount, the check point is abnormal for the memory usage increasing slope, and the problem to be checked is that the check point in the check rule corresponding to the read-write long tail is abnormal for the read-write calling frequency of the last week.
In the above embodiment of the present application, in step S13, after the monitoring data of each check point is respectively processed based on the check rule corresponding to the problem to be checked, a corresponding processing result is obtained; next, in step S14, the question is called based on the processing result, since the processing result is: the problem to be checked is that the total amount of created files is abnormal, the check point is abnormal due to the use increasing slope of the memory, the problem to be checked is that the read-write calling frequency of the check point in the check rule corresponding to the read-write long tail is abnormal, the problem to be checked is that the read-write calling frequency of the last week is abnormal, the problem corresponding to the problem is called as the memory leakage and the read-write long tail, and in the step S14, health early warning information is generated and fed back based on the relevant information of the problem.
Further, the relevant information of the question includes at least any one of: the occurrence time of the problem, the monitoring data of each relevant check point, and the check point when the monitoring data is abnormal when the problem occurs.
Next, in the above embodiment, the health warning information is generated based on the relevant information of the problem, and the health warning information includes the problem and the corresponding occurrence time thereof, and each of the check points and the monitoring data thereof where the monitoring data is abnormal when the problem occurs.
For example, based on the processing result: the problem to be checked is that the check point in the check rule corresponding to the memory leak is abnormal when the total amount of created files is abnormal, the check point is abnormal when the memory use increasing slope is increased, the problem to be checked is that the check point in the check rule corresponding to the read-write long tail is abnormal when the read-write calling frequency of the last week is abnormal, and then the problem to be checked is called as the memory leak and the read-write long tail; and generating and feeding back the health early warning information according to an early warning report template in the distributed file system based on the relevant information of the problem, wherein the generated health early warning information is { { memory leak: creating that the total number of files is 34 and abnormal at t1, and the memory use increase slope is 48% and abnormal at t2 }; { read-write long tail: and when t3, if the read-write calling frequency of the last week is 75.6% abnormal, feeding the abnormal information back to system maintenance personnel, and pre-warning each check point with a problem under the cluster in advance and processing related health pre-warning information by the system maintenance personnel based on the health pre-warning information fed back, so that the real-time performance of multi-check point monitoring on the online distributed file system is improved, the purposes of pre-warning the check points in the cluster and processing the health pre-warning information are achieved, and the accuracy of pre-judging the health condition of each check point corresponding to the problem in the cluster is also improved.
In the step S14, if the monitoring data of each checkpoint in all the processing results does not exceed the abnormal threshold, health state information is generated, so that the distributed file system maintenance staff can know that the entire distributed file system is in a healthy state without performing health early warning processing.
In the embodiment of the present application, in the process of performing aggregation calculation on the monitoring data of the check points of each cluster in the distributed sorting system by using the problem rule base, the problem rule base needs to be created and continuously updated as shown in fig. 2.
Fig. 2 is a flow chart illustrating a method for creating a question rule base according to another aspect of the present application. The method includes step S15 and step S16.
Wherein the step S15 includes: creating a problem rule base, wherein the problem rule base comprises at least one problem and a corresponding check rule; the step S16 includes: and updating the problems in the problem rule base and the corresponding check rules.
In an embodiment of the present application, before performing aggregation calculation on monitoring data of each checkpoint in the distributed file system, the problem rule base needs to be created, where the problem rule base includes at least one problem and a check rule corresponding to each problem, and the check rule includes at least one checkpoint and an exception threshold of each checkpoint, that is, when a problem occurs, multiple checkpoints are abnormal, and the problem corresponding to the abnormal checkpoint is predicted based on the multiple checkpoints where the exception occurs.
For example, the question rule base includes question 1, question 2, and question 3, where the check rule corresponding to question 1 is { question 1: the exception threshold for checkpoint a is a1, the exception threshold for checkpoint B is B1, and the exception threshold for checkpoint C is C1 }; the checking rule corresponding to the question 2 is { question 2: the abnormality threshold of checkpoint D is D1, the abnormality threshold of checkpoint E is E1, and the abnormality threshold of checkpoint F is F1 }; the checking rule corresponding to the question 3 is { question 3: the anomaly threshold for checkpoint G is G1 and the anomaly threshold for checkpoint H is G1.
With the sudden increase of the mass data of the user, the scale of the distributed file system is also continuously increased, and because in the process of prejudging the health condition of the distributed file system with the continuously increased scale on the line, when a plurality of check points actually have abnormality in advance before a problem occurs, iterative computation needs to be performed according to the abnormal monitoring data of each check point in a fixed time period before the problem occurs, so as to find a corresponding check rule which can best reflect the abnormality of the problem, as shown in fig. 3.
Fig. 3 is a schematic flowchart illustrating a method of step S16 corresponding to a rule of creating a question bank in a method for checking health status of a cluster according to an embodiment of the present application. The method comprises the following steps: step S161, step S162, step S163, and step S164.
Wherein, the step S161 obtains the relevant information of the cluster to be checked, the problem to be updated and the initial monitoring threshold thereof; the step S162 obtains, from the related information of the cluster, the occurrence time point of the problem to be updated and the monitoring data of all the check points in a set time period before the occurrence time point based on the initial monitoring threshold, and determines and records the abnormal check point based on the monitoring data; the step S163, when the problem to be updated occurs within each of the set time periods, updates the probability of occurrence of each of the checkpoints at the time of occurrence of the problem to be updated, based on the checkpoint of the abnormality recorded in the current set time period and the checkpoint of the abnormality recorded in history; the step S164 updates the check rule of the problem to be updated based on the updated check point and the related information thereof, the probability of occurrence of which is higher than the set probability.
In the embodiment of the present application, when a problem in the problem rule base needs to be updated, first, step S161 obtains the cluster position information and the inspection time period of the cluster to be inspected, the problem to be updated to be trained, and an initial monitoring threshold corresponding to the problem to be updated; then, the step S162 obtains, from the cluster corresponding to the cluster position information based on the initial monitoring threshold, the occurrence time point of the problem to be updated in the inspection time period and the monitoring data of all the inspection points in a set time period before the occurrence time point, and records the inspection point where the monitoring data is abnormal; then, the step S163 updates, when the problem to be updated occurs within each of the set time periods, the probability of occurrence of each of the checkpoints at the time of occurrence of the problem to be updated, based on the checkpoints of the abnormalities recorded within the current set time period and the checkpoints of the abnormalities recorded in history; finally, in step S164, based on the check point with the updated occurrence probability higher than the set probability and the relevant information thereof, the check rule of the problem to be updated is updated, so that the problem rule base is updated by updating the check rule of the problem to be updated, so that the problem rule base can more comprehensively and accurately reflect the abnormal check point in the distributed file system, the health status of the multiple check points corresponding to the problem occurs is monitored, and the accuracy and the real-time performance of the health status pre-judgment of each check point corresponding to the problem in the cluster are improved.
Further, the initial monitoring threshold includes: an anomaly threshold for monitoring data for all of the checkpoints and a weight threshold for the checkpoint at which an anomaly occurred; the step 162 includes: based on the initial monitoring threshold, acquiring the occurrence time point of the problem to be updated and the monitoring data of all the check points in a set time period before the occurrence time point from the related information of the cluster, and determining and recording the abnormal check points based on the monitoring data; specifically, the step 162 includes: based on the abnormal threshold of the monitoring data of all the check points, acquiring the occurrence time point of the problem to be updated and the monitoring data of all the check points in a set time period before the occurrence time point from the related information of the cluster, and recording the corresponding check point when the weight of the abnormal check point exceeds the weight threshold, wherein the weight of the check point is determined based on the occurrence probability of the abnormal check point.
It should be noted that, when the problem to be updated occurs, the probability of occurrence of the checkpoint and the weight of the checkpoint are calculated as follows. If the problem 1 to be updated occurs 1000 times within the set time period, the checkpoint a occurs 654 times, the checkpoint B occurs 252 times, and the checkpoint C occurs 94 times, the probability of occurrence of the checkpoint a is 65.4%, the probability of occurrence of the checkpoint B is 25.2%, and the probability of occurrence of the checkpoint C is 9.4%, where the weight of the checkpoint a is: 65.4%/(65.4% + 25.2% + 9.4%) -65.4%, the weight of checkpoint B being: 25.2%/(65.4% + 25.2% + 9.4%) -25.2%, the weight of checkpoint C being: 9.4%/(65.4% + 25.2% + 9.4%) to 9.4%. Of course, other methods for calculating the probability of the checkpoint occurring and the weight of the checkpoint that may occur now or in the future, such as those applicable to the present application, are also included in the scope of the present application and are hereby incorporated by reference.
Preferably, the determining and recording the check point of the abnormality based on the monitoring data in the step S162 includes: judging whether the monitoring data of the check point exceeds an abnormal threshold value; and if the detected data exceeds the preset threshold, determining and recording the corresponding abnormal check point.
For example, the check rule corresponding to the problem 1 to be updated in the problem database is obtained as { problem 1: an anomaly threshold value of a checkpoint a is a1, an anomaly threshold value of a checkpoint B is B1, an anomaly threshold value of a checkpoint C is C1, and a weight threshold value of the checkpoint is 10%, acquiring monitoring data of all checkpoints respectively corresponding to a time point t of occurrence of the problem 1 to be updated and set time periods (t +. DELTA.t) and (t + 2. DELTA.t) before the time point t of occurrence based on the cluster position information and the inspection time period to be inspected, and recording the checkpoint corresponding to the anomaly based on whether the monitoring data of the checkpoint exceeds the anomaly threshold value. If the acquired monitoring data of the check points exceed the corresponding abnormal threshold value in a set time period (t +. DELTA.t) before the occurrence time point t, the following steps are respectively performed: checkpoint A, checkpoint B and checkpoint C, wherein the probability of occurrence of checkpoint A within the set time period (t +. DELTA.t) is 65.4%, the probability of occurrence of the checkpoint B is 25.2%, the probability of occurrence of the checkpoint C is 9.4%, a weight calculation is performed based on the probability of occurrence, the weight of the check points is the ratio information of the occurrence probability of each check point and the sum of the occurrence probabilities of all the check points, the weight of the check point A in the set time period (t +. DELTA.t) is 65.4 percent, the weight of checkpoint B is 25.2%, the weight of checkpoint C is 9.4%, the weight threshold based on the checkpoint is 10%, so that the check points corresponding to the check point in which an abnormality is recorded when the weight of the check point exceeds the weight threshold value are 65.4% of the check point a and the weight thereof and 25.2% of the check point B and the weight thereof in the set time period (t +. DELTA.t). If the acquired monitoring data of the check point exceed the corresponding abnormal threshold value in the set time period (t +2 Δ t) before the occurrence time point t, respectively: a checkpoint a, a checkpoint B, and a checkpoint D, wherein an appearance probability of the checkpoint a is 50.5%, an appearance probability of the checkpoint B is 1.4%, and an appearance probability of the checkpoint D is 48.1% in the setting time period (t +2 Δ t), and weight calculation is performed based on the appearance probabilities, and a weight of the checkpoint a is 50.5%, a weight of the checkpoint B is 1.4%, a weight of the checkpoint D is 48.1%, and a weight threshold based on the checkpoint is 10% in the setting time period (t +2 Δ t), so that the checkpoint is 50.5% of the checkpoint a and the checkpoint D is 48.1% of the checkpoint when the weight of the checkpoint recording an abnormality exceeds the weight threshold in the setting time period (t +2 Δ t).
Further, the step S163 includes: when the problem to be updated occurs in each set time period, updating the occurrence probability of each check point when the problem to be updated occurs based on the check point of the abnormality recorded in the current set time period and the check point of the abnormality recorded in history; specifically, the step S163 includes: when the problem to be updated occurs in each set time period, determining the current weight of the check point in the set time period based on the occurrence probability of the abnormal check point recorded in the set time period; updating the probability of occurrence of each of the checkpoints when the problem to be updated occurs based on the current weight of the checkpoint and the historical weight of the checkpoint for historical recorded anomalies.
Next, in the above embodiment of the present application, if the problem 1 to be updated is within the current setting time period (t +. DELTA.t) before the abnormality occurs at time t, the corresponding check points when the weight of the abnormal check point exceeds the weight threshold are the check point a and the current weight of the check point a of 65.4%, and the check point B and the current weight of the check point B of 25.2%, where the current weight of the check point is determined based on the occurrence probability of the check point; if the problem 1 to be updated is in the historical set time period (t +2 Δ t) before the abnormality occurs at the moment t, the corresponding check points when the weight of the abnormal check point exceeds the weight threshold are recorded as the check point A and the historical weight of the check point A of 50.5 percent and the check point D and the historical weight of the check point D of 48.1 percent; updating the probability of occurrence of each checkpoint at the time of occurrence of the problem to be updated 1 based on the current weight and the historical weight of each checkpoint of the problem to be updated 1, i.e., updating the integrated weight of each checkpoint at the time of occurrence of the problem to be updated, wherein the integrated weight of the checkpoint is the average of the current weight and the historical weight of the checkpoint, then the integrated weight of checkpoint a corresponding to the problem to be updated is (65.4% + 50.5%)/2 is 57.95%, the integrated weight of checkpoint B is (25.2% + 1.4%)/2 is 13.3%, the integrated weight of checkpoint D is (0+ 48.1%)/2 is 24.05%, and then updating the probability of occurrence of each checkpoint at the time of occurrence of the problem to be updated 1 based on the intra-file weight of the checkpoint and the historical weight of the checkpoint recorded in history, that is, the updated probability of occurrence of checkpoint a is 57.95%, the updated probability of occurrence of checkpoint B is 13.3%, and the updated probability of occurrence of checkpoint D is 24.05%.
Following the above embodiment of the present application, the setting probability in step S164 is consistent with the weight threshold value of the check point, that is, the setting probability is 10%, since 57.95% of the check point a and its updated occurrence probability corresponding to the problem 1 to be updated are higher than the setting probability 10%, 13.3% of the check point B and its updated occurrence probability are higher than the setting probability 10%, and 24.05% of the check point D and its updated occurrence probability are higher than the setting probability 10%, the check point C is discarded from the check point rule of the problem 1 to be updated in the problem rule base, the check point a, the check point B, the check point D and its corresponding abnormal threshold value are added to the check rule of the problem to be updated in the problem rule base, and based on the check point a, the check point B, and the check point D and their related information whose updated occurrence probability is higher than the setting probability, and updating the check rule of the problem to be updated.
Further, the relevant information of the checkpoint comprises at least any one of: an anomaly threshold for monitoring data of the checkpoint, a weight for the checkpoint, wherein the weight for the checkpoint is determined based on a probability of occurrence of the checkpoint.
Next, in the above embodiment of the present application, the exception threshold a1 and the weight 57.95% of the checkpoint a and the corresponding monitoring data, the exception threshold B1 and the weight 13.3% of the checkpoint B and the corresponding monitoring data, and the exception threshold D1 and the weight 24.05% of the checkpoint D and the corresponding monitoring data are updated as the check rule of the problem 1 to be updated.
As the distributed file system is continuously health checked and health early warning information is acquired, and in the process of performing advanced processing based on the health early warning information, a user may acquire more than one piece of checking result information after processing the distributed file system, and when a problem actually occurs and a plurality of check points are abnormal in advance, iterative computation needs to be performed according to abnormal monitoring data of each check point in a fixed time period before the problem occurs based on the acquired checking result information, so as to find a corresponding checking rule which can best reflect the abnormal condition of the problem, as shown in fig. 4.
Fig. 4 is a schematic flowchart illustrating a method of step S16 corresponding to a rule of creating a question bank in a method for checking health status of a cluster according to yet another embodiment of the present application. The method comprises the following steps: step S165, step S166, step S167, and step S168.
Step S165 obtains a problem to be updated, and obtains an occurrence time point of the problem to be updated from at least one piece of inspection result information; the step S166 acquires monitoring data of all the check points in a set time period before the occurrence time point, and determines and records the abnormal check points based on the monitoring data; the step S167 updates the probability of occurrence of each of the checkpoints when the problem to be updated occurs, based on the checkpoints of the anomalies recorded within the set time period and the checkpoints of the anomalies recorded in history; the step S168 updates the check rule of the problem to be updated based on the updated check point and the related information thereof, the probability of occurrence of which is higher than the set probability.
It should be noted that the examination result information is result information related to the health-warning information, which is obtained in the process of examining the distributed file system. The inspection result information includes at least any one of: the problem which is abnormal, the occurrence time point of the problem, the corresponding check point which is abnormal when the problem occurs and the abnormal threshold value thereof. Of course, other information about the examination results that may be present or later come into existence, such as applicable to the present application, should also be included in the scope of the present application, and is hereby incorporated by reference.
In the embodiment of the present application, when a problem in the problem rule base needs to be updated, first, the step S165 obtains a problem to be updated, and obtains an occurrence time point of the problem to be updated from at least one piece of inspection result information; next, the step S166 acquires the monitoring data of all the check points in the set time period before the occurrence time point, and determines and records the abnormal check points based on the monitoring data of the check points and the abnormal threshold thereof; then, the step S167 updates the probability of occurrence of each of the checkpoints when the problem to be updated occurs, based on the checkpoints of the anomalies recorded within the current set time period and the checkpoints of the anomalies recorded in history; finally, in step S168, based on the updated check point with the probability of occurrence higher than the set probability and the related information thereof, the check rule of the problem to be updated is updated, so that the check rule of the problem to be updated is updated through the obtained at least one piece of check result information, so as to update the problem rule base, so that the problem rule base can more comprehensively and more accurately reflect the abnormal check point in the distributed file system, monitor the health status of the plurality of check points corresponding to the problem occurring, and improve the accuracy and real-time performance of predicting the health status of each check point corresponding to the problem in the cluster.
FIG. 5 illustrates an apparatus for checking health status of a cluster according to an aspect of the subject application. The method comprises an information acquisition device 11, a rule acquisition device 12, a monitoring processing device 13 and an early warning feedback device 14.
Wherein, the information obtaining device 11 obtains the relevant information of the cluster to be checked; the rule obtaining device 12 obtains at least one problem to be checked and a corresponding checking rule; the monitoring processing device 13 acquires the monitoring data of the check point related to the check rule from the cluster based on the related information of the cluster, and performs aggregation processing on the monitoring data to obtain a processing result; the early warning feedback device 14 retrieves the corresponding problem based on the processing result, and generates and feeds back health early warning information based on the relevant information of the problem.
Here, the device 1 includes, but is not limited to, a user device, or a device formed by integrating a user device and a network device through a network. The user equipment includes, but is not limited to, any mobile electronic product, such as a smart phone, a PDA, and the like, capable of human-computer interaction with a user through a touch panel, and the mobile electronic product may employ any operating system, such as an android operating system, an iOS operating system, and the like. The network device includes an electronic device capable of automatically performing numerical calculation and information processing according to preset or stored instructions, and the hardware includes but is not limited to a microprocessor, an Application Specific Integrated Circuit (ASIC), a programmable gate array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like. Including, but not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a VPN network, a wireless Ad Hoc network (Ad Hoc network), etc. Preferably, the central scheduling device may also be a script program running on a device formed by integrating the user device and a network device through a network. Of course, those skilled in the art should understand that the above-mentioned central scheduling device is only an example, and other existing or future central scheduling devices may be applicable to the present application, and shall be included in the scope of the present application, and is incorporated herein by reference.
The above devices are operated continuously, and herein, those skilled in the art should understand that "continuously" means that the above devices are operated in real time or according to the set or real-time adjusted operating mode requirement.
In the embodiment of the present application, the cluster to be checked in the information obtaining apparatus 11 is located on one or more cluster nodes in a distributed file system, where the distributed file system refers to that physical storage resources managed by the file system are not necessarily directly connected to a local node, but are connected to the node through a computer network. The following describes the present application in detail with reference to a distributed file system as an example. Of course, the detailed explanation of the specific embodiment of the present application is made by taking a distributed file system as an example, and the embodiment of the present application is only for illustrative purposes, and is not limited thereto, and the following embodiments may be implemented in other distributed cluster systems as well.
Further, the checkpoint comprises at least any one of: hardware devices in the cluster, and local modules of software devices in the cluster.
It should be noted that the check point in the monitoring processing apparatus 13 may be a local module including, but not limited to, a hardware device including a single server under each cluster node in the distributed file system and a software device in the distributed file system. The hardware devices of the server include, but are not limited to, a central processing unit, a memory, a hard disk, a chipset, an input/output bus, an input/output device, a power supply, a chassis, and the like, and the local modules of the software device include, but are not limited to, a system setup program module, a fault diagnosis program module, a fault handling program module, and the like. Of course, other existing or future occurrences of such checkpoints, as may be applicable to the present application, are also intended to be encompassed within the scope of the present application and are hereby incorporated by reference.
Further, the information obtaining device 11 obtains relevant information of the cluster to be checked based on a request submitted by a user, where the relevant information includes: cluster location information and a check period.
In the embodiment of the application, when the health condition of the online distributed file system needs to be monitored, a request submitted by a user is obtained, the cluster position information of a cluster to be checked and the checking time period for monitoring the cluster to be checked are obtained based on the request submitted by the user, wherein the cluster position information and the checking time period both belong to the related information of the cluster to be checked.
For example, when the health condition of the online distributed file system needs to be monitored, a request submitted by a user for monitoring each check point in the cluster is obtained, cluster position information corresponding to the cluster to be inspected is obtained based on the request submitted by the user, and one or more inspection time periods corresponding to the obtained monitoring data of the plurality of check points are obtained, where the cluster position information may be an actual geographical position range where cluster nodes distributed in different areas are located, or may also be an actual geographical position range where cluster nodes in the same area are located.
Further, the rule obtaining device 12 obtains at least one question to be checked and a corresponding checking rule from a question rule base.
It should be noted that the problem rule base in the rule obtaining device 12 mainly includes already established problems and a plurality of corresponding check rules. The problems include, but are not limited to, memory leak, read-write long tail, data loss, and system performance problems, and system availability problems and quality of service problems; the inspection rules include an exception threshold for a checkpoint and its corresponding monitored data. Of course, other existing or future problem rule bases may be applicable to this application and are intended to be encompassed within the scope of this application and are hereby incorporated by reference.
For example, if the problem in the problem rule base is a memory leak, the corresponding check rule includes: the check point is the change rate of the service pressure of the last week and the corresponding abnormal threshold value thereof, the check point is the total created file amount and the corresponding abnormal threshold value thereof, and the check point is the memory use increasing slope and the corresponding abnormal threshold value thereof; the problem of the problem rule base is reading and writing long tail, and the corresponding check rule comprises the following steps: the check point is the read-write calling frequency of the last week and the corresponding abnormal threshold value thereof, the check point is the retransmission rate of the network in the cluster and the corresponding abnormal threshold value thereof, and the check point is the value information of the health state of the disk in the cluster and the corresponding abnormal threshold value thereof.
Further, the monitoring processing device 13 includes: a search unit (not shown) and a data acquisition unit (not shown), wherein the search unit (not shown) is configured to search the cluster based on the cluster position information and acquire a check point related to the check rule in the cluster; the data obtaining unit (not shown) is configured to obtain monitoring data related to the check point in the check time period from the monitoring modules of the cluster.
It should be noted that the monitoring module of the cluster is mainly responsible for acquiring monitoring data of each hardware device and each checkpoint related to the software device from the monitoring system in the cluster. Of course, other existing or future monitoring modules of the cluster, as may be applicable to the present application, are also intended to be included within the scope of the present application and are hereby incorporated by reference.
In the above embodiment of the present application, if the cluster position information is shanghai geographic position information, the searching unit (not shown) searches for the shanghai cluster based on the shanghai geographic position information, and acquires each check point related to the check rule from the shanghai cluster; the data obtaining unit (not shown) obtains the monitoring data of each relevant check point in the check time period from the monitoring module of the shanghai cluster, where the obtained monitoring data of the check point is 34% of the total created file amount, the obtained monitoring data of the check point is 48% of the memory usage increasing slope, the obtained monitoring data of the check point is 1% of the change rate of the service pressure of the last week, the obtained monitoring data of the check point is 75.6% of the read-write calling frequency of the last week, the obtained monitoring data of the check point is 5.3% of the retransmission rate of the network in the cluster, and the obtained monitoring data of the score information of the disk health state in the cluster is 100%.
The monitoring processing device 13 includes: and the data processing unit (not shown) is used for respectively processing the monitoring data of each check point based on a check rule corresponding to the problem to be checked so as to obtain at least one check point with monitoring data exception and feed back a processing result.
In the above-described embodiments of the present application, it is possible to determine whether or not there is a problem to be checked by comparing the monitoring data of a plurality of check points, respectively, based on the check rule corresponding to the problem to be checked in the data processing unit. If the distributed file system on the pre-judgment line has the problem of memory leakage, the total amount of files can be created by respectively comparing the change rate of the service pressure of the last week, and the memory carries out corresponding matching of the inspection rules by using the monitoring data corresponding to the three inspection points of the increased slope so as to obtain a processing result for pre-judgment; if the distributed file system on the pre-judgment line has the problem of long tail reading and writing, the pre-judgment can be carried out by respectively carrying out corresponding matching of the inspection rules on the monitoring data corresponding to three inspection points, namely the reading and writing calling frequency of a week, the retransmission rate of a network in a cluster and the value information of the health state of a disk in the cluster, so as to obtain a processing result.
For example, if in the check rule that the problem to be checked is memory leak, the exception threshold of the checkpoint for the total created file amount is 30, and since the monitoring data 34 of the checkpoint for the total created file amount exceeds the exception threshold 30, the checkpoint gives an exception to the total created file amount; the check point is that the abnormal threshold value of the memory usage increase slope is 20%, since the 48% of the monitoring data of the memory usage increase slope exceeds the abnormal threshold value by 20%, the check point is that the memory usage increase slope is abnormal, the check point is that the abnormal threshold value of the change rate of the service pressure of the last week is 14%, and since the 1% of the monitoring data of the change rate of the service pressure of the last week is less than the 14% of the abnormal threshold value, the change rate of the service pressure of the last week is normal; if the check point is the check rule of the read-write long tail, the abnormal threshold of the read-write calling frequency of the check point for the last week is 30%, since 75.6% of the monitoring data of the read-write calling frequency of the check point for the last week exceeds the abnormal threshold 30%, the read-write calling frequency of the check point for the last week is abnormal, the check point is the abnormal threshold of the retransmission rate of the network in the cluster is 10%, since 5.3% of the monitoring data of the retransmission rate of the network in the cluster is less than 10% of the abnormal threshold, the check point is the normal retransmission rate of the network in the cluster, the check point is the abnormal threshold of the score information of the disk health state in the cluster is 60, and since 15% of the monitoring data of the score information of the disk health state in the cluster is less than 60, the check point is the normal score information of the disk health state in the cluster, therefore, the obtained processing result indicates that the problem to be checked is that the check point in the check rule corresponding to the memory leak is abnormal for the total created file amount, the check point is abnormal for the memory usage increasing slope, and the problem to be checked is that the check point in the check rule corresponding to the read-write long tail is abnormal for the read-write calling frequency of the last week.
In the above embodiment of the present application, after the monitoring data of each check point is respectively processed based on the check rule corresponding to the problem to be checked in the monitoring processing device 13, a corresponding processing result is obtained; next, in the early warning feedback device 14, the processing result is used to retrieve the corresponding problem, because the processing result is: the problem to be checked is that the total amount of created files is abnormal in the check point in the check rule corresponding to the memory leak, the check point is abnormal in the memory use increasing slope, the problem to be checked is that the read-write calling frequency of the check point in the check rule corresponding to the read-write long tail is abnormal in the last week, the problem to be checked is called as the memory leak and the read-write long tail, and in the early warning feedback device 14, health early warning information is generated and fed back based on the relevant information of the problem.
Further, the relevant information of the question includes at least any one of: the occurrence time of the problem, the monitoring data of each relevant check point, and the check point when the monitoring data is abnormal when the problem occurs.
Next, in the above embodiment, the health warning information is generated based on the relevant information of the problem, and the health warning information includes the problem and the corresponding occurrence time thereof, and each of the check points and the monitoring data thereof where the monitoring data is abnormal when the problem occurs.
For example, based on the processing result: the problem to be checked is that the check point in the check rule corresponding to the memory leak is abnormal when the total amount of created files is abnormal, the check point is abnormal when the memory use increasing slope is increased, the problem to be checked is that the check point in the check rule corresponding to the read-write long tail is abnormal when the read-write calling frequency of the last week is abnormal, and then the problem to be checked is called as the memory leak and the read-write long tail; and generating and feeding back the health early warning information according to an early warning report template in the distributed file system based on the relevant information of the problem, wherein the generated health early warning information is { { memory leak: creating that the total number of files is 34 and abnormal at t1, and the memory use increase slope is 48% and abnormal at t2 }; { read-write long tail: and when t3, if the read-write calling frequency of the last week is 75.6% abnormal, feeding the abnormal information back to system maintenance personnel, and pre-warning each check point with a problem under the cluster in advance and processing related health pre-warning information by the system maintenance personnel based on the health pre-warning information fed back, so that the real-time performance of multi-check point monitoring on the online distributed file system is improved, the purposes of pre-warning the check points in the cluster and processing the health pre-warning information are achieved, and the accuracy of pre-judging the health condition of each check point corresponding to the problem in the cluster is also improved.
In the early warning feedback device 14, if the monitoring data of each check point in all the processing results does not exceed the abnormal threshold, health state information is generated, so that the distributed file system maintenance personnel can know that the whole distributed file system is in a health state without performing health early warning processing.
In the embodiment of the present application, in the process of performing aggregation calculation on the monitoring data of the checkpoint of each cluster in the distributed sorting system by using the problem rule base, the problem rule base needs to be created and continuously updated as shown in fig. 6.
Fig. 6 is a schematic diagram of a corresponding device for creating a question rule base in a device for checking health status of a cluster according to another aspect of the present application. The device 1 further comprises a creation rule means 15 and a rule update means 16.
Wherein, the creating rule device 15 creates a question rule base, and the question rule base comprises at least one question and a corresponding check rule; the rule updating device 16 updates the questions in the question rule base and the corresponding check rules.
In an embodiment of the present application, before performing aggregation calculation on monitoring data of each checkpoint in the distributed file system, the problem rule base needs to be created, where the problem rule base includes at least one problem and a check rule corresponding to each problem, and the check rule includes at least one checkpoint and an exception threshold of each checkpoint, that is, when a problem occurs, multiple checkpoints are abnormal, and the problem corresponding to the abnormal checkpoint is predicted based on the multiple checkpoints where the exception occurs.
For example, the question rule base includes question 1, question 2, and question 3, where the check rule corresponding to question 1 is { question 1: the exception threshold for checkpoint a is a1, the exception threshold for checkpoint B is B1, and the exception threshold for checkpoint C is C1 }; the checking rule corresponding to the question 2 is { question 2: the abnormality threshold of checkpoint D is D1, the abnormality threshold of checkpoint E is E1, and the abnormality threshold of checkpoint F is F1 }; the checking rule corresponding to the question 3 is { question 3: the anomaly threshold for checkpoint G is G1 and the anomaly threshold for checkpoint H is G1.
With the sudden increase of the mass data of the user, the scale of the distributed file system is also continuously increased, and because in the process of prejudging the health condition of the distributed file system with the continuously increased scale on the line, when a plurality of check points actually have abnormality in advance before a problem occurs, iterative computation needs to be performed according to the abnormal monitoring data of each check point in a fixed time period before the problem occurs, so as to find a corresponding check rule which can best reflect the abnormality of the problem, as shown in fig. 3.
Fig. 7 shows a schematic structural diagram of the rule updating apparatus 16 in the device for checking the health status of the cluster according to an embodiment of the present application. The rule updating device 16 includes: a first information acquisition unit 161, a first recording unit 162, a first probability updating unit 163, and a first rule updating unit 164.
The first information obtaining unit 161 obtains relevant information of a cluster to be checked, a problem to be updated, and an initial monitoring threshold thereof; the first recording unit 162 obtains, based on the initial monitoring threshold, the occurrence time point of the problem to be updated and the monitoring data of all the check points in a set time period before the occurrence time point from the related information of the cluster, and determines and records the abnormal check point based on the monitoring data; the first probability updating unit 163, when the problem to be updated occurs within each of the set time periods, updates the probability of occurrence of each of the checkpoints at the time of occurrence of the problem to be updated based on the checkpoints of the abnormality recorded for the current set time period and the checkpoints of the abnormality recorded in history; the first rule updating unit 164 updates the check rule of the problem to be updated based on the updated check point and the related information thereof, the occurrence probability of which is higher than the set probability.
In the embodiment of the present application, when a problem in the problem rule base needs to be updated, first, the first information obtaining unit 161 obtains the cluster position information and the inspection time period of the cluster to be inspected, the problem to be updated to be trained, and an initial monitoring threshold corresponding to the problem to be updated; then, the first recording unit 162 obtains, from a cluster corresponding to the cluster position information based on the initial monitoring threshold, an occurrence time point of the problem to be updated in the inspection time period and monitoring data of all the inspection points in a set time period before the occurrence time point, and records the inspection point where the monitoring data is abnormal; then, the first probability updating unit 163 updates, when the problem to be updated occurs within each of the set time periods, the probability of occurrence of each of the checkpoints at the time of occurrence of the problem to be updated, based on the checkpoints of the abnormality recorded within the current set time period and the checkpoints of the abnormality recorded in history; finally, the first rule updating unit 164 updates the check rule of the problem to be updated based on the updated check point with the occurrence probability higher than the set probability and the related information thereof, so as to update the problem rule base by updating the check rule of the problem to be updated, so that the problem rule base can more comprehensively and accurately reflect the abnormal check point in the distributed file system, monitor the health status of the plurality of check points corresponding to the problem, and improve the accuracy and the real-time performance of the health status pre-judgment of each check point corresponding to the problem in the cluster.
Further, the initial monitoring threshold includes: an anomaly threshold for monitoring data for all of the checkpoints and a weight threshold for the checkpoint at which an anomaly occurred; the first recording unit 162 is configured to: based on the abnormal threshold of the monitoring data of all the check points, acquiring the occurrence time point of the problem to be updated and the monitoring data of all the check points in a set time period before the occurrence time point from the related information of the cluster, and recording the corresponding check point when the weight of the abnormal check point exceeds the weight threshold, wherein the weight of the check point is determined based on the occurrence probability of the abnormal check point.
It should be noted that, when the problem to be updated occurs, the probability of occurrence of the checkpoint and the weight of the checkpoint are calculated as follows. If the problem 1 to be updated occurs 1000 times within the set time period, the checkpoint a occurs 654 times, the checkpoint B occurs 252 times, and the checkpoint C occurs 94 times, the probability of occurrence of the checkpoint a is 65.4%, the probability of occurrence of the checkpoint B is 25.2%, and the probability of occurrence of the checkpoint C is 9.4%, where the weight of the checkpoint a is: 65.4%/(65.4% + 25.2% + 9.4%) -65.4%, the weight of checkpoint B being: 25.2%/(65.4% + 25.2% + 9.4%) -25.2%, the weight of checkpoint C being: 9.4%/(65.4% + 25.2% + 9.4%) to 9.4%. Of course, other methods for calculating the probability of the checkpoint occurring and the weight of the checkpoint that may occur now or in the future, such as those applicable to the present application, are also included in the scope of the present application and are hereby incorporated by reference.
Preferably, the first recording unit 162 includes: a judging subunit (not shown) and a recording subunit (not shown), wherein the judging subunit (not shown) is configured to judge whether the monitoring data of the inspection point exceeds an abnormal threshold; the recording subunit (not shown) is configured to determine and record the checkpoint for the respective anomaly if exceeded.
For example, the check rule corresponding to the problem 1 to be updated in the problem database is obtained as { problem 1: an anomaly threshold value of a checkpoint a is a1, an anomaly threshold value of a checkpoint B is B1, an anomaly threshold value of a checkpoint C is C1, and a weight threshold value of the checkpoint is 10%, acquiring monitoring data of all checkpoints respectively corresponding to a time point t of occurrence of the problem 1 to be updated and set time periods (t +. DELTA.t) and (t + 2. DELTA.t) before the time point t of occurrence based on the cluster position information and the inspection time period to be inspected, and recording the checkpoint corresponding to the anomaly based on whether the monitoring data of the checkpoint exceeds the anomaly threshold value. If the acquired monitoring data of the check points exceed the corresponding abnormal threshold value in a set time period (t +. DELTA.t) before the occurrence time point t, the following steps are respectively performed: checkpoint A, checkpoint B and checkpoint C, wherein the probability of occurrence of checkpoint A within the set time period (t +. DELTA.t) is 65.4%, the probability of occurrence of the checkpoint B is 25.2%, the probability of occurrence of the checkpoint C is 9.4%, a weight calculation is performed based on the probability of occurrence, the weight of the check points is the ratio information of the occurrence probability of each check point and the sum of the occurrence probabilities of all the check points, the weight of the check point A in the set time period (t +. DELTA.t) is 65.4 percent, the weight of checkpoint B is 25.2%, the weight of checkpoint C is 9.4%, the weight threshold based on the checkpoint is 10%, so that the check points corresponding to the check point in which an abnormality is recorded when the weight of the check point exceeds the weight threshold value are 65.4% of the check point a and the weight thereof and 25.2% of the check point B and the weight thereof in the set time period (t +. DELTA.t). If the acquired monitoring data of the check point exceed the corresponding abnormal threshold value in the set time period (t +2 Δ t) before the occurrence time point t, respectively: a checkpoint a, a checkpoint B, and a checkpoint D, wherein an appearance probability of the checkpoint a is 50.5%, an appearance probability of the checkpoint B is 1.4%, and an appearance probability of the checkpoint D is 48.1% in the setting time period (t +2 Δ t), and weight calculation is performed based on the appearance probabilities, and a weight of the checkpoint a is 50.5%, a weight of the checkpoint B is 1.4%, a weight of the checkpoint D is 48.1%, and a weight threshold based on the checkpoint is 10% in the setting time period (t +2 Δ t), so that the checkpoint is 50.5% of the checkpoint a and the checkpoint D is 48.1% of the checkpoint when the weight of the checkpoint recording an abnormality exceeds the weight threshold in the setting time period (t +2 Δ t).
Further, the first probability updating unit 163 includes: a weight determination subunit (not shown) and a probability updating subunit (not shown), wherein the weight determination subunit (not shown) is configured to determine, when the problem to be updated occurs within each of the set time periods, a current weight of the checkpoint within the current set time period based on an occurrence probability of the checkpoint in which an abnormality is recorded for the current set time period; the probability updating subunit (not shown) is used for updating the occurrence probability of each checkpoint when the problem to be updated occurs based on the current weight of the checkpoint and the historical weight of the historical abnormal checkpoint.
Next, in the above embodiment of the present application, if the problem 1 to be updated is within the current setting time period (t +. DELTA.t) before the abnormality occurs at time t, the corresponding check points when the weight of the abnormal check point exceeds the weight threshold are the check point a and the current weight of the check point a of 65.4%, and the check point B and the current weight of the check point B of 25.2%, where the current weight of the check point is determined based on the occurrence probability of the check point; if the problem 1 to be updated is in the historical set time period (t +2 Δ t) before the abnormality occurs at the moment t, the corresponding check points when the weight of the abnormal check point exceeds the weight threshold are recorded as the check point A and the historical weight of the check point A of 50.5 percent and the check point D and the historical weight of the check point D of 48.1 percent; updating the probability of occurrence of each checkpoint at the time of occurrence of the problem to be updated 1 based on the current weight and the historical weight of each checkpoint of the problem to be updated 1, i.e., updating the integrated weight of each checkpoint at the time of occurrence of the problem to be updated, wherein the integrated weight of the checkpoint is the average of the current weight and the historical weight of the checkpoint, then the integrated weight of checkpoint a corresponding to the problem to be updated is (65.4% + 50.5%)/2 is 57.95%, the integrated weight of checkpoint B is (25.2% + 1.4%)/2 is 13.3%, the integrated weight of checkpoint D is (0+ 48.1%)/2 is 24.05%, and then updating the probability of occurrence of each checkpoint at the time of occurrence of the problem to be updated 1 based on the intra-file weight of the checkpoint and the historical weight of the checkpoint recorded in history, that is, the updated probability of occurrence of checkpoint a is 57.95%, the updated probability of occurrence of checkpoint B is 13.3%, and the updated probability of occurrence of checkpoint D is 24.05%.
Following the above embodiment of the present application, the setting probability in step S164 is consistent with the weight threshold value of the check point, that is, the setting probability is 10%, since 57.95% of the check point a and its updated occurrence probability corresponding to the problem 1 to be updated are higher than the setting probability 10%, 13.3% of the check point B and its updated occurrence probability are higher than the setting probability 10%, and 24.05% of the check point D and its updated occurrence probability are higher than the setting probability 10%, the check point C is discarded from the check point rule of the problem 1 to be updated in the problem rule base, the check point a, the check point B, the check point D and its corresponding abnormal threshold value are added to the check rule of the problem to be updated in the problem rule base, and based on the check point a, the check point B, and the check point D and their related information whose updated occurrence probability is higher than the setting probability, and updating the check rule of the problem to be updated.
Further, the relevant information of the checkpoint comprises at least any one of: an anomaly threshold for monitoring data of the checkpoint, a weight for the checkpoint, wherein the weight for the checkpoint is determined based on a probability of occurrence of the checkpoint.
Next, in the above embodiment of the present application, the exception threshold a1 and the weight 57.95% of the checkpoint a and the corresponding monitoring data, the exception threshold B1 and the weight 13.3% of the checkpoint B and the corresponding monitoring data, and the exception threshold D1 and the weight 24.05% of the checkpoint D and the corresponding monitoring data are updated as the check rule of the problem 1 to be updated.
As the distributed file system is continuously health checked and health early warning information is acquired, and in the process of performing advanced processing based on the health early warning information, a user may acquire more than one piece of checking result information after processing the distributed file system, and when a problem actually occurs and a plurality of check points are abnormal in advance, iterative computation needs to be performed according to abnormal monitoring data of each check point in a fixed time period before the problem occurs based on the acquired checking result information, so as to find a corresponding checking rule which can best reflect the abnormal condition of the problem, as shown in fig. 8.
Fig. 8 is a schematic structural diagram illustrating a rule updating apparatus 16 in a device for checking health status of a cluster according to yet another embodiment of the present application. The rule updating device 16 includes: a second information acquisition unit 165, a second recording unit 166, a second probability updating unit 167, and a second rule updating unit 168.
Wherein the second information obtaining unit 165 obtains a problem to be updated, and obtains an occurrence time point of the problem to be updated from at least one piece of the inspection result information; the second recording unit 166 acquires the monitoring data of all the check points in a set time period before the occurrence time point, and determines and records the abnormal check points based on the monitoring data; the second probability updating unit 167 updates the probability of occurrence of each of the checkpoints when the problem to be updated occurs, based on the checkpoints of the anomalies recorded within the set period of time and the checkpoints of the anomalies recorded in history; the second rule updating unit 168 updates the check rule of the problem to be updated based on the updated check point and the related information thereof, the probability of occurrence of which is higher than a set probability.
It should be noted that the inspection result information includes at least any one of the following items: the problem which is abnormal, the occurrence time point of the problem, the corresponding check point which is abnormal when the problem occurs and the abnormal threshold value thereof. Of course, other information about the examination results that may be present or later come into existence, such as applicable to the present application, should also be included in the scope of the present application, and is hereby incorporated by reference.
In the embodiment of the present application, when a problem in the problem rule base needs to be updated, first, the second information obtaining unit 165 obtains a problem to be updated, and obtains an occurrence time point of the problem to be updated from at least one piece of inspection result information; then, the second recording unit 166 acquires the monitoring data of all the check points in a set time period before the occurrence time point, and determines and records the abnormal check points based on the monitoring data of the check points and the abnormal threshold thereof; then, the second probability updating unit 167 updates the probability of occurrence of each of the checkpoints when the problem to be updated occurs, based on the checkpoint of the abnormality recorded in the current set time period and the checkpoint of the abnormality recorded in history; finally, the second rule updating unit 168 updates the check rule of the problem to be updated based on the updated check point with the occurrence probability higher than the set probability and the related information thereof, so as to update the check rule of the problem to be updated by obtaining at least one piece of check result information to update the problem rule base, so that the problem rule base can more comprehensively and accurately reflect the abnormal check point in the distributed file system, monitor the health status of the plurality of check points corresponding to the problem when the problem occurs, and improve the accuracy and the real-time performance of the health status prejudgment of each check point corresponding to the problem in the cluster.
In addition, the present application also provides an apparatus for checking health status of a cluster, including:
a processor;
and a memory arranged to store computer executable instructions that, when executed, cause the processor to:
acquiring related information of a cluster to be checked;
acquiring at least one problem to be checked and a check rule corresponding to the problem;
acquiring monitoring data of a check point related to the check rule from the cluster based on the related information of the cluster, and performing aggregation processing on the monitoring data to obtain a processing result;
and calling the corresponding problem based on the processing result, and generating and feeding back health early warning information based on the relevant information of the problem.
Compared with the prior art, the method and the equipment for checking the health state of the cluster provided by the embodiment of the application acquire the relevant information of the cluster to be checked; acquiring at least one problem to be checked and a check rule corresponding to the problem; acquiring monitoring data of a check point related to the check rule from the cluster based on the related information of the cluster, and performing aggregation processing on the monitoring data to obtain a processing result; and calling the corresponding problem based on the processing result, and generating and feeding back health early warning information based on the relevant information of the problem. Before the health condition of the distributed file system on the line is prejudged, the problem to be detected which is abnormal on the line of the distributed file system as far as possible is correspondingly regularized to obtain the inspection rule corresponding to the problem to be detected, so that when the health condition of the distributed file system on the line is prejudged, the monitoring data corresponding to each inspection point can be directly obtained, the monitoring data of the inspection points are aggregated by using the inspection rule to obtain a processing result, the accuracy of health condition monitoring on a plurality of inspection points under each cluster node is improved, the corresponding problem is called based on the processing result, health early warning information is generated and fed back based on the relevant information of the problem, and the maintenance personnel can early warn each inspection point which finds the problem under each cluster node in advance and process the relevant health early warning information based on the health early warning information fed back, therefore, the real-time performance of multi-check point monitoring on the on-line distributed file system is improved, and the aim of multi-point alarming in advance is fulfilled; further, aggregating the monitoring data to obtain a processing result includes: and respectively processing the monitoring data of each check point based on the check rule corresponding to the problem to be checked so as to obtain at least one check point with abnormal monitoring data and feed back a processing result, thereby realizing the monitoring of the health condition of a plurality of check points corresponding to the problem and improving the accuracy of pre-judging the health condition of each check point corresponding to the problem in the cluster.
Further, the method and the device for checking the health status of the cluster provided by the embodiments of the present application further create a problem rule base, where the problem rule base includes at least one problem and a check rule corresponding to the problem; the problems in the problem rule base and the corresponding check rules are updated, the creation of the check rules of all check points with problems as far as possible in the on-line distributed file system is ensured, the problems in the problem rule base and the corresponding check rules are updated based on the monitoring data of all the check points, the abnormal check points in the distributed file system can be reflected more comprehensively and more accurately in the created problem rule base, the health condition of the multiple corresponding check points when the problems occur is monitored, and the accuracy and the real-time performance of the health condition prejudgment of all the check points corresponding to the problems in the cluster are improved.
Further, updating the questions in the question rule base and the corresponding check rules thereof includes: acquiring relevant information of a cluster to be checked, a problem to be updated and an initial monitoring threshold value of the problem to be updated; based on the initial monitoring threshold, acquiring the occurrence time point of the problem to be updated and the monitoring data of all the check points in a set time period before the occurrence time point from the related information of the cluster, and determining and recording the abnormal check points based on the monitoring data; when the problem to be updated occurs in each set time period, updating the occurrence probability of each check point when the problem to be updated occurs based on the check point of the abnormality recorded in the current set time period and the check point of the abnormality recorded in history; updating the check rule of the problem to be updated based on the updated check points and the related information thereof, wherein the updated check rule of the problem to be updated is higher than the set probability, so that the problem rule base is updated by updating the check rule of the problem to be updated based on the check points and the related information thereof, wherein the check points and the related information thereof are recorded in the set time period, and the probability of occurrence of each check point when the problem to be updated occurs is updated based on the check points of the abnormality recorded in the set time period and the check points of the abnormality recorded in the history when the problem to be updated occurs in each set time period, so that the check rule of the problem to be updated is updated by updating the check rule of the problem to be updated, the problem rule base can reflect abnormal check points in the distributed file system more comprehensively and accurately, the health conditions of the check points corresponding to the problems are monitored when the problems occur, and the accuracy and the real-time performance of the health condition prejudgment of the check points corresponding to the problems in the cluster are improved.
It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.
In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Program instructions which invoke the methods of the present application may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. An embodiment according to the present application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or a solution according to the aforementioned embodiments of the present application.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Claims (27)
1. A method for checking cluster health status, wherein the method comprises:
acquiring related information of a cluster to be checked;
obtaining at least one problem to be checked and a corresponding check rule from a problem rule base;
acquiring monitoring data of a check point related to the check rule from the cluster based on the related information of the cluster, and performing aggregation processing on the monitoring data to obtain a processing result;
calling the corresponding problem based on the processing result, and generating and feeding back health early warning information based on the relevant information of the problem;
updating the problems in the problem rule base and the corresponding check rules, wherein the process comprises the following steps:
acquiring a problem to be updated;
determining an abnormal checkpoint based on the monitored data;
updating the occurrence probability of the abnormal check point when the problem to be updated occurs;
and updating the check rule of the problem to be updated based on the occurrence probability.
2. The method of claim 1, wherein the obtaining information about the cluster to be inspected comprises:
based on a request submitted by a user, obtaining relevant information of a cluster to be checked, wherein the relevant information comprises: cluster location information and a check period.
3. The method of claim 2, wherein said obtaining monitoring data of a checkpoint associated with the inspection rule from the cluster comprises:
searching the cluster based on the cluster position information, and acquiring a check point related to the check rule in the cluster;
and acquiring monitoring data related to the check point in the check time period from a monitoring module of the cluster.
4. The method of claim 1, wherein the aggregating the monitoring data to obtain a processing result comprises:
and respectively processing the monitoring data of each check point based on a check rule corresponding to the problem to be checked so as to obtain at least one check point with abnormal monitoring data and feed back a processing result.
5. The method of claim 1, wherein the information related to the question comprises at least any one of:
the occurrence time of the problem, the monitoring data of each relevant check point, and the check point when the monitoring data is abnormal when the problem occurs.
6. The method of claim 1, wherein the method further comprises:
creating a problem rule base, wherein the problem rule base comprises at least one problem and a corresponding check rule;
and updating the problems in the problem rule base and the corresponding check rules.
7. The method of claim 6, wherein the updating the questions in the question rule base and their corresponding check rules comprises:
acquiring relevant information of a cluster to be checked, a problem to be updated and an initial monitoring threshold value of the problem to be updated;
based on the initial monitoring threshold, acquiring the occurrence time point of the problem to be updated and the monitoring data of all the check points in a set time period before the occurrence time point from the related information of the cluster, and determining and recording the abnormal check points based on the monitoring data;
when the problem to be updated occurs in each set time period, updating the occurrence probability of each check point when the problem to be updated occurs based on the check point of the abnormality recorded in the current set time period and the check point of the abnormality recorded in history;
updating the check rule of the problem to be updated based on the updated check point with the occurrence probability higher than the set probability and the relevant information thereof.
8. The method of claim 7, wherein the initial monitoring threshold comprises: an anomaly threshold for monitoring data for all of the checkpoints and a weight threshold for the checkpoint at which an anomaly occurred;
the acquiring, based on the initial monitoring threshold, the occurrence time point of the problem to be updated and the monitoring data of all the check points in a set time period before the occurrence time point from the related information of the cluster, and determining and recording the abnormal check point based on the monitoring data includes:
based on the abnormal threshold of the monitoring data of all the check points, acquiring the occurrence time point of the problem to be updated and the monitoring data of all the check points in a set time period before the occurrence time point from the related information of the cluster, and recording the corresponding check point when the weight of the abnormal check point exceeds the weight threshold, wherein the weight of the check point is determined based on the occurrence probability of the abnormal check point.
9. The method of claim 7, wherein the determining and recording the checkpoint of anomalies based on the monitoring data comprises:
judging whether the monitoring data of the check point exceeds an abnormal threshold value;
and if the detected data exceeds the preset threshold, determining and recording the corresponding abnormal check point.
10. The method of claim 6, wherein the updating the questions in the question rule base and their corresponding check rules comprises:
acquiring a problem to be updated, and acquiring the occurrence time point of the problem to be updated from at least one piece of inspection result information;
acquiring monitoring data of all the check points in a set time period before the occurrence time point, and determining and recording the abnormal check points based on the monitoring data;
updating the probability of occurrence of each check point when the problem to be updated occurs based on the check point of the recorded abnormality in the current set time period and the check points of the recorded abnormality;
updating the check rule of the problem to be updated based on the updated check point with the occurrence probability higher than the set probability and the relevant information thereof.
11. The method according to any of claims 7 to 10, wherein the checkpoint related information comprises at least any of:
an anomaly threshold for monitoring data of the checkpoint, a weight for the checkpoint, wherein the weight for the checkpoint is determined based on a probability of occurrence of the checkpoint.
12. The method according to any one of claims 7 to 9, wherein the updating of the probability of occurrence of each checkpoint at the time of occurrence of the problem to be updated based on the checkpoint of the abnormality recorded in the current set period and the checkpoint of the abnormality recorded in history, when the problem to be updated occurs in each set period, comprises:
when the problem to be updated occurs in each set time period, determining the current weight of the check point in the set time period based on the occurrence probability of the abnormal check point recorded in the set time period;
updating the probability of occurrence of each of the checkpoints when the problem to be updated occurs based on the current weight of the checkpoint and the historical weight of the checkpoint for historical recorded anomalies.
13. The method of any of claims 1 to 10, wherein the checkpoint comprises at least any of:
hardware devices in the cluster, and local modules of software devices in the cluster.
14. An apparatus for checking health status of a cluster, wherein the apparatus comprises:
the information acquisition device is used for acquiring the related information of the cluster to be checked;
the rule obtaining device is used for obtaining at least one problem to be checked and a corresponding checking rule from the problem rule base;
the monitoring processing device is used for acquiring monitoring data of a check point related to the check rule from the cluster based on the related information of the cluster, and performing aggregation processing on the monitoring data to obtain a processing result;
the early warning feedback device is used for calling the corresponding problem based on the processing result and generating and feeding back health early warning information based on the relevant information of the problem;
a rule updating device for updating the questions in the question rule base and the corresponding check rules, the process includes:
acquiring a problem to be updated;
determining an abnormal checkpoint based on the monitored data;
updating the occurrence probability of the abnormal check point when the problem to be updated occurs;
and updating the check rule of the problem to be updated based on the occurrence probability.
15. The apparatus of claim 14, wherein the information acquisition device is configured to:
based on a request submitted by a user, obtaining relevant information of a cluster to be checked, wherein the relevant information comprises: cluster location information and a check period.
16. The apparatus of claim 15, wherein the monitoring processing means comprises:
the searching unit is used for searching the cluster based on the cluster position information and acquiring a check point related to the check rule in the cluster;
and the data acquisition unit is used for acquiring the monitoring data related to the check point in the check time period from the monitoring module of the cluster.
17. The apparatus of claim 14, wherein the monitoring processing means comprises:
and the data processing unit is used for respectively processing the monitoring data of each check point based on the check rule corresponding to the problem to be checked so as to obtain at least one check point with monitoring data abnormity and feed back a processing result.
18. The apparatus of claim 14, wherein the information related to the question comprises at least any one of:
the occurrence time of the problem, the monitoring data of each relevant check point, and the check point when the monitoring data is abnormal when the problem occurs.
19. The apparatus of claim 14, wherein the apparatus further comprises:
the system comprises a creating rule device, a checking rule device and a processing device, wherein the creating rule device is used for creating a question rule base, and the question rule base comprises at least one question and a corresponding checking rule;
and the rule updating device is used for updating the problems in the problem rule base and the corresponding check rules.
20. The apparatus of claim 19, wherein the rule updating means comprises:
the first information acquisition unit is used for acquiring relevant information of a cluster to be checked, a problem to be updated and an initial monitoring threshold value of the problem to be updated;
a first recording unit, configured to obtain, based on the initial monitoring threshold, the occurrence time point of the problem to be updated and monitoring data of all the check points in a set time period before the occurrence time point from related information of the cluster, and determine and record the abnormal check point based on the monitoring data;
a first probability updating unit configured to update, when the problem to be updated occurs within each of the set time periods, an occurrence probability of each of the checkpoints at the time of occurrence of the problem to be updated based on the checkpoints of the anomalies recorded within the current set time period and the checkpoints of the anomalies recorded in history;
and the first rule updating unit is used for updating the check rule of the problem to be updated based on the updated check point with the occurrence probability higher than the set probability and the relevant information thereof.
21. The device of claim 20, wherein the initial monitoring threshold comprises: an anomaly threshold for monitoring data for all of the checkpoints and a weight threshold for the checkpoint at which an anomaly occurred;
the first recording unit is configured to:
based on the abnormal threshold of the monitoring data of all the check points, acquiring the occurrence time point of the problem to be updated and the monitoring data of all the check points in a set time period before the occurrence time point from the related information of the cluster, and recording the corresponding check point when the weight of the abnormal check point exceeds the weight threshold, wherein the weight of the check point is determined based on the occurrence probability of the abnormal check point.
22. The apparatus of claim 20, wherein the first recording unit comprises:
the judging subunit is used for judging whether the monitoring data of the check point exceeds an abnormal threshold value;
and the recording subunit is used for determining and recording the corresponding abnormal check point if the abnormal check point exceeds the preset check point.
23. The apparatus of claim 19, wherein the rule updating means comprises:
the second information acquisition unit is used for acquiring the problems to be updated and acquiring the occurrence time points of the problems to be updated from at least one piece of inspection result information;
the second recording unit is used for acquiring the monitoring data of all the check points in a set time period before the occurrence time point, and determining and recording the abnormal check points based on the monitoring data;
a second probability updating unit, configured to update an occurrence probability of each check point when the problem to be updated occurs, based on the check point of the abnormality recorded in the current set time period and the check points of the abnormality recorded in history;
and the second rule updating unit is used for updating the check rule of the problem to be updated based on the updated check point with the occurrence probability higher than the set probability and the relevant information thereof.
24. The device of any of claims 20 to 23, wherein the checkpoint related information comprises at least any of:
an anomaly threshold for monitoring data of the checkpoint, a weight for the checkpoint, wherein the weight for the checkpoint is determined based on a probability of occurrence of the checkpoint.
25. The apparatus of any of claims 20 to 22, wherein the first probability updating unit comprises:
a weight determination subunit, configured to determine, when the problem to be updated occurs within each set time period, a current weight of the checkpoint within the current set time period based on an occurrence probability of the abnormal checkpoint recorded in the current set time period;
and the probability updating subunit is used for updating the occurrence probability of each check point when the problem to be updated occurs based on the current weight of the check point and the historical weight of the check point with the exception of the historical record.
26. The device of any of claims 14 to 23, wherein the checkpoint comprises at least any of:
hardware devices in the cluster, and local modules of software devices in the cluster.
27. An apparatus for checking health status of a cluster, comprising:
a processor;
and a memory arranged to store computer executable instructions that, when executed, cause the processor to:
acquiring related information of a cluster to be checked;
obtaining at least one problem to be checked and a corresponding check rule from a problem rule base;
acquiring monitoring data of a check point related to the check rule from the cluster based on the related information of the cluster, and performing aggregation processing on the monitoring data to obtain a processing result;
calling the corresponding problem based on the processing result, and generating and feeding back health early warning information based on the relevant information of the problem;
updating the problems in the problem rule base and the corresponding check rules, wherein the process comprises the following steps:
acquiring a problem to be updated;
determining an abnormal checkpoint based on the monitored data;
updating the occurrence probability of the abnormal check point when the problem to be updated occurs;
and updating the check rule of the problem to be updated based on the occurrence probability.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610194499 | 2016-03-31 | ||
CN2016101944993 | 2016-03-31 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107391335A CN107391335A (en) | 2017-11-24 |
CN107391335B true CN107391335B (en) | 2021-09-03 |
Family
ID=60338371
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710205541.1A Active CN107391335B (en) | 2016-03-31 | 2017-03-31 | Method and equipment for checking health state of cluster |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107391335B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108255676A (en) * | 2018-01-15 | 2018-07-06 | 南京市城市规划编制研究中心 | A kind of monitoring method of software systems client health degree |
CN108874640B (en) * | 2018-05-07 | 2022-09-30 | 北京京东尚科信息技术有限公司 | Cluster performance evaluation method and device |
CN109376043A (en) * | 2018-10-18 | 2019-02-22 | 郑州云海信息技术有限公司 | A method and device for monitoring equipment |
CN110069393A (en) * | 2019-03-11 | 2019-07-30 | 北京互金新融科技有限公司 | Detection method, device, storage medium and the processor of software environment |
CN110278133B (en) * | 2019-07-31 | 2021-08-13 | 中国工商银行股份有限公司 | Checking method, device, computing equipment and medium executed by server |
CN113645525B (en) * | 2021-08-09 | 2023-06-02 | 中国工商银行股份有限公司 | Method, device, equipment and storage medium for checking operation state of optical fiber switch |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101123521A (en) * | 2006-08-07 | 2008-02-13 | 华为技术有限公司 | A management method for check points in cluster |
CN102957563A (en) * | 2011-08-16 | 2013-03-06 | 中国石油化工股份有限公司 | Linux cluster fault automatic recovery method and Linux cluster fault automatic recovery system |
CN104917627A (en) * | 2015-01-20 | 2015-09-16 | 杭州安恒信息技术有限公司 | Log cluster scanning and analysis method used for large-scale server cluster |
CN104954181A (en) * | 2015-06-08 | 2015-09-30 | 北京集奥聚合网络技术有限公司 | Method for warning faults of distributed cluster devices |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7484132B2 (en) * | 2005-10-28 | 2009-01-27 | International Business Machines Corporation | Clustering process for software server failure prediction |
US8887006B2 (en) * | 2011-04-04 | 2014-11-11 | Microsoft Corporation | Proactive failure handling in database services |
-
2017
- 2017-03-31 CN CN201710205541.1A patent/CN107391335B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101123521A (en) * | 2006-08-07 | 2008-02-13 | 华为技术有限公司 | A management method for check points in cluster |
CN102957563A (en) * | 2011-08-16 | 2013-03-06 | 中国石油化工股份有限公司 | Linux cluster fault automatic recovery method and Linux cluster fault automatic recovery system |
CN104917627A (en) * | 2015-01-20 | 2015-09-16 | 杭州安恒信息技术有限公司 | Log cluster scanning and analysis method used for large-scale server cluster |
CN104954181A (en) * | 2015-06-08 | 2015-09-30 | 北京集奥聚合网络技术有限公司 | Method for warning faults of distributed cluster devices |
Also Published As
Publication number | Publication date |
---|---|
CN107391335A (en) | 2017-11-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107391335B (en) | Method and equipment for checking health state of cluster | |
CN109239265B (en) | Fault detection method and device for monitoring equipment | |
JP6394726B2 (en) | Operation management apparatus, operation management method, and program | |
US10373065B2 (en) | Generating database cluster health alerts using machine learning | |
CN107301118B (en) | A method and system for automatic labeling of fault indicators based on logs | |
US20160378583A1 (en) | Management computer and method for evaluating performance threshold value | |
CN110286656B (en) | False alarm filtering method and device for tolerance of error data | |
US20140365829A1 (en) | Operation management apparatus, operation management method, and program | |
US20150269120A1 (en) | Model parameter calculation device, model parameter calculating method and non-transitory computer readable medium | |
US9524223B2 (en) | Performance metrics of a computer system | |
CN107025224B (en) | Method and equipment for monitoring task operation | |
US20230038164A1 (en) | Monitoring and alerting system backed by a machine learning engine | |
CN110008247B (en) | Method, device and equipment for determining abnormal source and computer readable storage medium | |
US11196613B2 (en) | Techniques for correlating service events in computer network diagnostics | |
CN111722952A (en) | Fault analysis method, system, equipment and storage medium of business system | |
CN102055604A (en) | Fault location method and system thereof | |
CA3051483C (en) | System and method for automated and intelligent quantitative risk assessment of infrastructure systems | |
CN107944721B (en) | Universal machine learning method, device and system based on data mining | |
CN117041029A (en) | Network equipment fault processing method and device, electronic equipment and storage medium | |
Atzmueller et al. | Anomaly detection and structural analysis in industrial production environments | |
CN117608974A (en) | Server fault detection method, device, equipment and media based on artificial intelligence | |
US12149402B2 (en) | Method and system for evaluating peer groups for comparative anomaly | |
CN108009063B (en) | Method for detecting fault threshold of electronic equipment | |
WO2013035266A1 (en) | Monitoring device, monitoring method and program | |
US9397921B2 (en) | Method and system for signal categorization for monitoring and detecting health changes in a database system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |