[go: up one dir, main page]

CN117909111A - Method, device, equipment and storage medium for processing monitoring data - Google Patents

Method, device, equipment and storage medium for processing monitoring data Download PDF

Info

Publication number
CN117909111A
CN117909111A CN202311747694.0A CN202311747694A CN117909111A CN 117909111 A CN117909111 A CN 117909111A CN 202311747694 A CN202311747694 A CN 202311747694A CN 117909111 A CN117909111 A CN 117909111A
Authority
CN
China
Prior art keywords
node
monitoring
application
blocking
abnormal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311747694.0A
Other languages
Chinese (zh)
Inventor
杨旋
王吉玲
张路
许雯莉
陈丽萍
王庆华
刘晓强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peoples Insurance Company of China
Original Assignee
Peoples Insurance Company of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peoples Insurance Company of China filed Critical Peoples Insurance Company of China
Priority to CN202311747694.0A priority Critical patent/CN117909111A/en
Publication of CN117909111A publication Critical patent/CN117909111A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/302Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application provides a method, a device, equipment and a storage medium for processing monitoring data, wherein the method comprises the following steps: receiving monitoring data sent by a monitoring tool, and judging whether an application node monitored by the monitoring tool is abnormal or not based on the monitoring data; the monitoring tool is used for monitoring the running state of the application node; when at least one application node is determined to be abnormal, recovering the application node to obtain a processing result; judging whether a blocking node exists in at least one application node based on the processing result; if yes, acquiring an abnormal log corresponding to the blocking node for data analysis, and generating an analysis result to remind a user to correct the blocking node based on the analysis result; the anomaly log includes monitoring indicator data of at least one type of monitoring tool; therefore, after the system alarms, the blocking node can be determined, and monitoring indexes of various types of monitoring tools corresponding to the blocking node are analyzed, so that the integrity and accuracy of analysis are improved, and the stability of the system is further ensured.

Description

Method, device, equipment and storage medium for processing monitoring data
Technical Field
The present application relates to the field of monitoring automation and alarm of application systems, and in particular, to a method, apparatus, device and storage medium for processing monitoring data.
Background
With the continuous development of monitoring tools, operation and maintenance personnel can accurately and timely master the operation state of the system, and alarm is given to abnormal equipment and resources, so that the operation and maintenance personnel can be helped to make maintenance planning early, and the safe and efficient operation of the equipment and the information system is ensured.
In the prior art, an application system is monitored, different alarm thresholds can be set for different monitoring indexes in different monitoring tools, and when the monitoring indexes exceed the corresponding alarm thresholds, different alarm levels are generated.
However, different monitoring tools can only play their own roles in the respective fields, and when the performance problem of the integrity of the application system occurs, the monitoring indexes of the monitoring tools are different, so that the analysis is complex when the monitoring indexes of the monitoring tools are analyzed, and the stability of the system is poor.
Disclosure of Invention
The application provides a monitoring data processing method, a device, equipment and a storage medium, which are used for solving the problems that different monitoring tools can only play their own roles in respective fields, and when the application system has the performance problem of manageability, the monitoring indexes of the monitoring tools are different, and when the monitoring indexes of the monitoring tools are analyzed, the analysis is complex, so that the stability of the system is poor.
In a first aspect, the present application provides a method of monitoring data processing, the method comprising:
receiving monitoring data sent by a first monitoring tool, and judging whether an application node monitored by the first monitoring tool is abnormal or not based on the monitoring data; the first monitoring tool is used for monitoring the running state of the application node;
When it is determined that the at least one application node monitored by the first monitoring tool is abnormal, recovering the at least one abnormal application node to obtain a processing result;
judging whether a blocking node exists in the at least one application node or not based on the processing result;
If yes, acquiring an abnormal log corresponding to the blocking node for data analysis, and generating an analysis result so as to remind a user to correct the blocking node based on the analysis result; the anomaly log includes monitoring indicator data for at least one type of monitoring tool.
Optionally, the method further comprises:
acquiring monitoring data of a second monitoring tool; the second monitoring tool is used for monitoring whether the application node is down;
When the fact that the application node monitored by the second monitoring tool is down is determined based on the monitoring data of the second monitoring tool, obtaining a down node;
judging whether the downtime node is consistent with at least one application node with abnormality;
If yes, a third monitoring tool is called, and the downtime node is restarted; the third monitoring tool is used for restarting the application node which is down.
Optionally, the method further comprises:
Searching position information of at least one application node with abnormality and the downtime node;
and graphically displaying the at least one abnormal application node and the downtime node based on the position information.
Optionally, the method further comprises:
And after determining that the downtime node is inconsistent with at least one abnormal application node, sending the at least one abnormal application node and the downtime node to a client for manual judgment so as to determine whether the at least one application node and the downtime node have normal application nodes.
Optionally, obtaining an exception log corresponding to the blocking node for data analysis, and generating an analysis result includes:
Acquiring a history blocking record corresponding to a blocking node in the abnormal log; the history blocking record comprises the recovery time and the blocking times of the blocking node;
And when the blocking times are greater than a first threshold value and/or the recovery time is greater than a second threshold value, generating an analysis result based on the abnormal log.
Optionally, the method further comprises:
And after the analysis result is generated, a third monitoring tool is called, and the blocking node is restarted.
Optionally, the method further comprises:
graphically displaying the monitoring index data of at least one type of monitoring tool; the monitoring index data includes: the method comprises the steps of interface call success rate, interface call response time, execution flow of each application node, memory occupation and CPU occupation of a central processing unit.
In a second aspect, the present application provides a monitoring data processing apparatus, the apparatus comprising:
The receiving module is used for receiving the monitoring data sent by the first monitoring tool and judging whether the application node monitored by the first monitoring tool is abnormal or not based on the monitoring data; the first monitoring tool is used for monitoring the running state of the application node;
The recovery module is used for carrying out recovery processing on at least one abnormal application node after determining that the at least one abnormal application node monitored by the first monitoring tool is abnormal, so as to obtain a processing result;
the judging module is used for judging whether a blocking node exists in the at least one application node or not based on the processing result;
The analysis module is used for acquiring an abnormal log corresponding to the blocking node for data analysis after determining that the blocking node exists in the at least one application node, and generating an analysis result so as to remind a user to correct the blocking node based on the analysis result; the anomaly log includes monitoring indicator data for at least one type of monitoring tool.
In a third aspect, the present application provides an electronic device comprising: a processor, and a memory communicatively coupled to the processor;
the memory stores computer-executable instructions;
The processor executes computer-executable instructions stored by the memory to implement the method of any one of the first aspects.
In a fourth aspect, the present application provides a computer-readable storage medium storing computer-executable instructions for implementing the method of any one of the first aspects when executed by a processor.
In summary, the present application provides a method, an apparatus, a device, and a storage medium for processing monitoring data, where monitoring alarm is implemented by fusing monitoring index data of at least one type of monitoring tool, and specifically, whether an application node is abnormal is determined based on monitoring data sent by a monitoring tool that monitors an operation state of the application node; if yes, further judging whether a blocking node exists, after determining that the blocking node exists, carrying out data analysis on an abnormal log corresponding to the blocking node and comprising monitoring index data of at least one type of monitoring tool, and generating an analysis result to remind a user to correct; therefore, the application can automatically analyze and process the error log based on the alarm, and greatly ensure the stability and the robustness of the system.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;
FIG. 2 is a schematic flow chart of a method for processing monitoring data according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of a graphical display of monitoring index data according to an embodiment of the present application;
FIG. 4 is a schematic flow chart of an alternative method for processing monitoring data according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a monitoring data processing apparatus according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Specific embodiments of the present application have been shown by way of the above drawings and will be described in more detail below. The drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but rather to illustrate the inventive concepts to those skilled in the art by reference to the specific embodiments.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.
In order to clearly describe the technical solution of the embodiments of the present application, in the embodiments of the present application, the words "first", "second", etc. are used to distinguish the same item or similar items having substantially the same function and effect. For example, the first device and the second device are merely for distinguishing between different devices, and are not limited in their order of precedence. It will be appreciated by those of skill in the art that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ.
In the present application, the words "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.
In the present application, "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a alone, a and B together, and B alone, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.
With the continuous development of monitoring tools, operation and maintenance personnel can accurately and timely master the operation state of the system, and alarm is given to abnormal equipment and resources, so that the operation and maintenance personnel can be helped to make maintenance planning early, and the safe and efficient operation of the equipment and the information system is ensured.
In one possible implementation manner, when the application system monitors, different alarm thresholds can be set for different monitoring indexes in different monitoring tools, and when the monitoring indexes exceed the corresponding alarm thresholds, different alarm levels are generated.
For example, different monitoring tools are used for monitoring different monitoring indexes, such as a monitoring host and application conditions can be monitored through the Zabbix monitoring tools; monitoring application performance conditions such as response rate, success rate, response time and transaction amount by an application performance management (Application Performance Management, APM) monitoring tool of a network traffic bypass technology; the system log is analyzed based on the ELK (ELASTICSEARCH, LOGSTASH, KIBANA) of the log analysis platform.
However, different monitoring tools can only play their own roles in the respective fields, and when the performance problem of the integrity of the application system occurs, the monitoring indexes of the monitoring tools are different, so that the analysis is complex when the monitoring indexes of the monitoring tools are analyzed, and the stability of the system is poor.
It should be noted that, the emphasis of the monitoring data of different monitoring tools is different, so the monitoring indexes are different, which results in that the system cannot combine and display the monitoring indexes after receiving the monitoring alarm, the use is inconvenient, the monitoring tools monitor the application nodes, and few monitoring tools analyze and process the monitoring data of the application nodes.
In view of the above problems, the present application provides a method for processing monitoring data, which performs data analysis by fusing monitoring index data of at least one type of monitoring tool, and specifically, determines whether an application node is abnormal based on monitoring data sent by the monitoring tool for monitoring an operation state of the application node; if yes, further judging whether a blocking node exists, after determining that the blocking node exists, carrying out data analysis on an abnormal log corresponding to the blocking node and comprising monitoring index data of at least one type of monitoring tool, and generating an analysis result to remind a user to correct; therefore, the application can automatically analyze and process the error log based on the alarm, and greatly ensure the stability and the robustness of the system.
Embodiments of the present application are described below with reference to the accompanying drawings. Fig. 1 is a schematic diagram of an application scenario provided by an embodiment of the present application, and the method for processing monitoring data provided by the present application may be applied to the application scenario shown in fig. 1. The application scene comprises: a first monitoring tool 101, a second monitoring tool 102, a third monitoring tool 103, a monitoring application 104 and a user's terminal device 105.
The first monitoring tool 101 may be a Java application server for developing, integrating, deploying and managing large-scale distributed Web applications, web applications and database applications, for monitoring the running states of the monitored components, the second monitoring tool 102 may be a monitoring tool for periodically grabbing the survival states of the monitored components through the HTTP protocol, and the third monitoring tool 103 may be a product for monitoring the host/container and the internet applications, for collecting monitoring indexes of the host/container resources (system performance, component services, databases, logs, etc.), detecting the availability of the internet application services, and performing alarm and automatic execution processing on the indexes.
Illustratively, the first monitoring tool 101 may monitor an operation state of at least one monitored component, where the monitored component may refer to various types of servers, and further send the operation state of the monitored component to the monitoring application 104 to determine whether an abnormality occurs in the operation state of the monitored component, where the abnormal operation state includes: downtime status and blocking status.
When the monitoring application system 104 determines that the first monitoring tool 101 is abnormal, the abnormal monitored component can be recovered, if the running state of the monitored component can be recovered, the running state of the monitored component is normal, and if the running state of the monitored component cannot be recovered, the monitored component can be blocked and/or down.
Further, after determining that the monitored component is blocked, the monitoring application 104 may obtain an exception log corresponding to the monitored component, where the exception log may include monitoring index data received by the monitoring application 104 from multiple types of monitoring tools, such as monitoring indexes of host/container resources collected by the third monitoring tool 103.
It should be noted that, in the embodiment of the present application, the number and types of the monitoring tools that the monitoring application 104 receives the monitoring index data are not limited in particular, and the first monitoring tool 101, the second monitoring tool 102, and the third monitoring tool 103 are merely exemplary, and the monitoring application 104 may also receive the monitoring data of more types of monitoring tools for processing.
Optionally, the monitoring application system 104 may obtain the survival status of at least one monitored component monitored by the second monitoring tool 102, including the running status and the downtime status, and after the monitoring application system 104 determines that the monitored component monitored by the second monitoring tool 102 and the monitored component monitored by the first monitoring tool 101 and having the downtime are consistent, the third monitoring tool 103 may be invoked to restart the monitored component monitored by the downtime.
The third monitoring tool 103 may also be used to restart the monitored component that is down.
Alternatively, the Terminal device may be various electronic devices having a display screen and supporting web browsing, and the Terminal device may also be referred to as a Terminal (Terminal), a User Equipment (UE), a Mobile Station (MS), a Mobile Terminal (MT), or the like. The terminal device may be a mobile phone, a smart television, a wearable device, a smart speaker, a smart security device, a smart gateway, a tablet computer (pad), a computer with wireless transceiving function, a Virtual Reality (VR) terminal device, an augmented reality (Augmented Reality, AR) terminal device, a wireless terminal in industrial control (industrial control), a wireless terminal in unmanned (self-driving), a wireless terminal in teleoperation (remote medical surgery), a wireless terminal in smart grid (SMART GRID), a wireless terminal in transportation security (transportation safety), a wireless terminal in smart city (SMART CITY), a wireless terminal in smart home (smart home), etc. Such terminal devices include, but are not limited to, smartphones, tablet computers, laptop portable computers, desktop computers, and the like.
The following describes the technical scheme of the present application and how the technical scheme of the present application solves the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Based on the application scenario shown in fig. 1, the embodiment of the application provides a monitoring data processing method. Fig. 2 is a flow chart of a method for processing monitoring data according to an embodiment of the present application. The execution subject of the monitoring data processing method is a monitoring application system, as shown in fig. 2, and the method includes:
S201, receiving monitoring data sent by a first monitoring tool, and judging whether an application node monitored by the first monitoring tool is abnormal or not based on the monitoring data; the first monitoring tool is used for monitoring the running state of the application node.
In the embodiment of the application, the running states comprise a running state, a blocking state and a closing state, and the closing state can comprise a downtime state and a stopping running state; the downtime state may refer to a phenomenon that a machine cannot recover from a system error, or a problem occurs in a hardware layer of a system, so that the system does not respond for a long time, and the system has to be restarted; the stop operation state can refer to the phenomenon that the machine is in no electricity or the system is normally closed, and the operation is stopped; the blocked state refers to a state in which the machine gives up the use right of the central processing unit (Central Processing Unit, CPU) for some reason, and temporarily stops running.
It should be noted that, the embodiment of the present application does not limit a specific monitoring platform corresponding to the first monitoring tool, and may be used to monitor an operation state of an application node; an application node may refer to a monitored component, a monitored product, or a monitored cluster, etc., and embodiments of the present application are not limited in this regard.
The first monitoring tool can monitor the running states of a plurality of even one application node, and then the first monitoring tool sends the monitored running states of the application node to the monitoring application system for processing so as to judge whether the application node is abnormal, namely whether a downtime node and a blocking node exist or not; the downtime node is an application node generating downtime, and the blocking node is an application node generating blocking.
S202, after determining that at least one application node monitored by the first monitoring tool is abnormal, recovering the at least one abnormal application node to obtain a processing result.
In the embodiment of the application, the occurrence of the abnormality of the application node can mean that the application node is in a downtime state and/or a blocking state, so that the application carries out recovery processing on at least one application node with the abnormality. If the application node is in the downtime state, the recovery can not be performed, but the application node in the blocking state can be performed.
In this step, after determining that at least one application node is abnormal, recovery processing may be performed on the application node having the abnormality, to obtain a processing result, where the processing result includes: the application nodes which are not restored to the running state may have blocking nodes and/or downtime nodes, and further, the application nodes are determined to be the blocking nodes or the downtime nodes.
The method for restoring the application node according to the embodiment of the present application is not particularly limited, and may refer to the existing method, or may define a new method for restoring the application node.
S203, judging whether a blocking node exists in the at least one application node based on the processing result.
In the embodiment of the application, because the downtime node and/or the blocking node exist in the processing result, the processing result needs to be judged to determine the blocking node.
Optionally, a second monitoring tool may be introduced to determine whether a blocking node exists in the processing result, where the second monitoring tool is configured to monitor whether an application node is down, determine, based on the second monitoring tool monitoring the application node that is down, the application node that is down in the processing result, and further determine the blocking node of the processing result.
Alternatively, a predefined identification method may be used to identify the blocking node in the processing result, and the embodiment of the present application does not specifically limit the predefined identification method.
It should be noted that, the method for determining whether there is a blocking node in at least one application node where an abnormality occurs in the embodiment of the present application is not limited in particular, and the foregoing is merely illustrative, and the determination may be performed based on other methods, such as a recognition model, etc.
S204, if yes, acquiring an abnormal log corresponding to the blocking node for data analysis, and generating an analysis result so as to remind a user to correct the blocking node based on the analysis result; the anomaly log includes monitoring indicator data for at least one type of monitoring tool.
In the embodiment of the application, at least one type of monitoring tool can refer to a monitoring tool for realizing different monitoring functions, and each type of monitoring tool corresponds to monitoring index data, such as a monitoring host corresponding to a monitoring tool of a Zabbix type and the like and the monitoring index data of an application condition; monitoring index data of application performance conditions such as corresponding response rate, success rate, response time and transaction amount of APM monitoring tools and the like of the network flow bypass technology; the ELK of the log analysis platform corresponds to the monitoring index data of the system log.
In this step, the monitoring application system may include a log analysis platform, and further, the ELK based on the log analysis platform may analyze a system log (abnormal log) corresponding to the blocking node, that is, by integrating monitoring index data of at least one type of monitoring tool, automatic analysis and fault processing are implemented, so as to determine an application node that has a fault, and remind a user to overhaul the application node.
Optionally, after the abnormal log corresponding to the blocking node is obtained, each corresponding performance index in the system can be automatically displayed in a graphical manner so as to realize monitoring and alarming; wherein, for the faulty application node, the display may be highlighted, for example, in the form of color labeling, display frame circling, etc., which is not particularly limited in the embodiment of the present application.
Therefore, the embodiment of the application provides a monitoring data processing method, which can automatically realize analysis and processing in a system by integrating monitoring index data of at least one type of monitoring tool corresponding to a blocking node, so that the stability and the robustness of the system are greatly ensured while the analysis flow is simplified.
Optionally, the method further comprises:
acquiring monitoring data of a second monitoring tool; the second monitoring tool is used for monitoring whether the application node is down;
When the fact that the application node monitored by the second monitoring tool is down is determined based on the monitoring data of the second monitoring tool, obtaining a down node;
judging whether the downtime node is consistent with at least one application node with abnormality;
If yes, a third monitoring tool is called, and the downtime node is restarted; the third monitoring tool is used for restarting the application node which is down.
In the embodiment of the application, the third monitoring tool can be used for collecting the monitoring indexes of the host/container resources, detecting the availability of the Internet application service, alarming and automatically executing the indexes, wherein the monitoring indexes comprise CPU use conditions, memory use conditions, server loads, loads of a database connection pool and the like; the third monitoring tool is further used for restarting the application node which is down, and all application nodes in an execution flow are not required to be restarted.
In this step, the second monitoring tool is configured to monitor the survival state of the application node, where the survival state includes a normal state and a downtime state, so that the survival state of the application node monitored by the second monitoring tool can be compared with the operation state of the application node monitored by the first monitoring tool to determine whether the downtime nodes monitored by the two monitoring tools are consistent; if yes, the automatic restarting flow on the third monitoring tool can be called, and the down node is restarted.
Therefore, the embodiment of the application can compare the monitoring data of the second monitoring tool with the monitoring data of the first monitoring tool to determine the downtime node, improve the accuracy of determining the downtime node, and simplify the restarting process by restarting the downtime node by utilizing the third monitoring tool.
Optionally, the method further comprises:
Searching position information of at least one application node with abnormality and the downtime node;
and graphically displaying the at least one abnormal application node and the downtime node based on the position information.
In the embodiment of the present application, the location information may refer to a location corresponding to at least one application node having an abnormality in an execution flow, for example, the first monitoring tool monitors that an application node corresponding to a certain execution flow has an application node 1-application node 2-application node 3, and when it is determined that the application node 2 has an abnormality, the application node 2 may be highlighted when the whole execution flow is displayed by the application node 2.
Alternatively, the method of highlighting may be color labeling, underlining, adding a display frame, adding a special label, and the like, and the embodiment of the present application does not specifically limit the method of highlighting.
The monitoring application system can visually display the application node included in the whole execution flow where the abnormal at least one application node is located after the position information corresponding to the at least one application node where the abnormality occurs and the downtime node is found, and the at least one application node where the abnormality occurs is marked with red, correspondingly, the application node included in the whole execution flow where the downtime node is located is visually displayed, and the application node where the downtime node occurs is marked with yellow, and if the downtime node is consistent with the at least one application node where the abnormality occurs, the application node can be marked with orange.
It should be noted that, in the embodiment of the present application, the labeling forms of at least one application node and the downtime node, where the abnormality occurs, are not specifically limited, and may correspond to the same labeling form, may be different, and the foregoing is only illustrative.
Optionally, when at least one application node and the downtime node which are abnormal are graphically displayed, text box forms can be used for annotating text descriptions for the application node in a preset range of the at least one application node and the downtime node.
Therefore, the embodiment of the application can be used for carrying out the outstanding display on the application nodes corresponding to the displayed abnormal index data, is visual and clear, and is convenient for users to find out the problem.
Optionally, the method further comprises:
And after determining that the downtime node is inconsistent with at least one abnormal application node, sending the at least one abnormal application node and the downtime node to a client for manual judgment so as to determine whether the at least one application node and the downtime node have normal application nodes.
In the embodiment of the application, when the downtime node is inconsistent with at least one abnormal application node, the at least one abnormal application node may have a blocking node, and may also be the downtime node or the detection error of the at least one abnormal application node, so the at least one abnormal application node and the downtime node may be sent to the terminal equipment of the user, and based on a manual searching problem, for example, whether the at least one abnormal application node and the downtime node judge error or not is determined, wherein the normal application node corresponds to the at least one abnormal application node, or the at least one abnormal application node has the downtime node and the blocking node.
Optionally, after the problem is found manually, a result of manual analysis is obtained, and the result of manual analysis can be fed back to the monitoring application system to perform state correction or analysis processing of the application node.
Therefore, after the downtime node determined by the second monitoring tool is inconsistent with the at least one abnormal application node determined by the first monitoring tool, the at least one abnormal application node and the downtime node can be sent to the client for manual judgment, so that whether the monitoring tool is in error judgment or not can be determined, and the accuracy of monitoring judgment is improved.
Optionally, obtaining an exception log corresponding to the blocking node for data analysis, and generating an analysis result includes:
Acquiring a history blocking record corresponding to a blocking node in the abnormal log; the history blocking record comprises the recovery time and the blocking times of the blocking node;
And when the blocking times are greater than a first threshold value and/or the recovery time is greater than a second threshold value, generating an analysis result based on the abnormal log.
In the embodiment of the present application, the recovery time of the blocking node refers to a time corresponding to the time when the operation is temporarily stopped to the normal operation; the blocking times refer to the times of temporarily stopping the operation of the application node, and the recovery time and the blocking times of the blocking node are not particularly limited in the embodiment of the application, and can be determined based on application scenes.
The first threshold may refer to a threshold set in advance for determining that the number of blocking times of the blocking node is excessive; the second threshold may refer to a threshold set in advance and used for determining that the recovery time of the blocking node is too long, and the specific numerical values corresponding to the first threshold and the second threshold are not limited in the embodiment of the present application.
In the step, statistics and analysis can be performed on the recovery time and the blocking times of all blocking nodes based on the ELK log analysis platform, and when the recovery time of the blocking node is greater than a second threshold value, for example, within 10 minutes, the blocking node still does not recover to normal operation, an automatic restarting process on a third monitoring tool is called to restart the blocking node; and/or when the blocking times of the blocking nodes are greater than a first threshold value, and the times of recovery processing is greater than 5 times like one blocking node, calling an automatic restarting flow on a third monitoring tool to restart the blocking nodes.
And when the blocking times of the blocking node are greater than a first threshold value and/or the recovery time is greater than a second threshold value, generating analysis results based on system logs corresponding to the blocking node on other types of monitoring tools so as to determine whether other monitoring index data are abnormal.
Optionally, whether other monitoring index data are abnormal or not is judged, and whether the monitoring index data are abnormal or not can be determined by setting a monitoring index threshold, for example, whether the response rate of interface call, the success rate of interface call, the response time of interface call and the like exceed the corresponding monitoring index threshold or not is judged, so that whether the abnormality occurs or not is determined.
Therefore, after the blocking times and/or the recovery time of the blocking node are determined to exceed the corresponding threshold values, the embodiment of the application automatically acquires the monitoring index data of the related monitoring tools of various types of the system to generate the analysis result, searches the problem, improves the accuracy of data analysis and can accurately position the problem.
Optionally, the method further comprises:
And after the analysis result is generated, a third monitoring tool is called, and the blocking node is restarted.
In the step, after an analysis result is generated, whether the blocking node needs to be restarted or not can be automatically judged based on the influence degree of the blocking node; if the blocking times of the blocking node are greater than a first threshold value and/or the recovery time of the blocking node is greater than a second threshold value, automatically calling an automatic restarting flow on a third monitoring tool to restart the blocking node; otherwise, the recovery process for the blocking node may continue.
It should be noted that, when the blocking number of times of the blocking node is greater than the first threshold value and/or the recovery time of the blocking node is greater than the second threshold value, the blocking node may be regarded as a downtime node.
Optionally, whether the blocking node needs to be restarted can also be determined based on the importance degree of the blocking node, such as priority information, and the restart condition of the blocking node is not specifically limited in the embodiment of the present application.
Therefore, the embodiment of the application can automatically judge whether the blocking node needs to be restarted based on the influence degree of the blocking node, and can select to restart the blocking node when the influence degree of the blocking node is higher, thereby improving the flexibility of restarting the blocking node.
Optionally, the method further comprises:
graphically displaying the monitoring index data of at least one type of monitoring tool; the monitoring index data includes: the method comprises the steps of interface call success rate, interface call response time, execution flow of each application node, memory occupation and CPU occupation of a central processing unit.
In the embodiment of the application, the success rate of interface call = the successful times of interface call/the total times of interface call; the response rate of interface call = the number of interface call responses/the total number of interface calls; the number of interface call responses includes the number of successful interface call and the number of interface call failures, but gives response information, such as clicking an interface module, without skipping the corresponding module, but giving error reporting information of response '404'.
The response time of the interface call may refer to the time from calling an interface to giving response information, and the interface call success rate, the response rate of the interface call and the response time of the interface call are not particularly limited.
Optionally, when the memory occupation and/or the CPU occupation is greater than the corresponding preset threshold, it may also be determined that an abnormality occurs, so as to generate alarm information; the setting size of the preset threshold is not particularly limited in the embodiment of the application, and the setting size can be set based on application scenes.
In this step, a unified monitoring display model interface may be established, and monitoring index data of at least one type of monitoring tool may be automatically acquired for graphical display, and accordingly, after determining that an application node or monitoring index data is abnormal based on the monitoring index data, the abnormal application node or monitoring index data may also be highlighted. It should be noted that, the method of graphical display may refer to the description of the above embodiments, and will not be repeated here.
For example, fig. 3 is a schematic flow chart of graphical display of monitoring index data provided by the embodiment of the present application, and as shown in fig. 3, when an alarm of a monitoring application system is generated based on P5 traffic, monitoring index data of multiple types of monitoring tools may be automatically summarized for visual display.
The success rate, the response rate and the response time of the interface call of approximately 5 minutes monitored by the monitoring tool 1 are visually displayed; visually displaying the survival states of all application nodes of the system monitored by the monitoring tool 2; the running states of all application nodes of the monitored system of the monitoring tool 3 are visually displayed; the method comprises the steps of visually displaying indexes such as CPU, memory and the like of all machines of a system monitored by a monitoring tool 4; visually displaying the call chain monitored by the monitoring tool 5; the call chain is the execution flow of each application node.
Furthermore, the data analysis can be performed based on the monitoring index data of the monitoring tools of the plurality of types, so as to find out the generated problems.
Therefore, the embodiment of the application can automatically make a graphical display of various performance indexes of the system, is used for carrying out integral analysis on the system, simplifies the analysis flow and improves the monitoring convenience.
In connection with the above embodiment, fig. 4 is a schematic flow chart of an alternative monitoring data processing method according to an embodiment of the present application, as shown in fig. 4, the monitoring data processing method includes the following steps:
step A: the running states of all application nodes on the monitoring tool 3 are displayed, the survival states of all application nodes on the monitoring tool 2 are displayed, and then the system judges whether downtime nodes exist or not based on the running states of the application nodes: if yes, executing the step B; if not, executing the step C.
And (B) step (B): judging whether the downtime node determined by the monitoring tool 3 is consistent with the downtime node determined by the monitoring tool 2, and if so, executing the step D: if the abnormal application nodes monitored by the monitoring tool 3 are inconsistent with the downtime nodes monitored by the monitoring tool 2, the abnormal application nodes and the downtime nodes are sent to the manual judgment.
Step C: and D, determining that abnormal nodes exist, further performing node recovery to determine whether blocking nodes exist, analyzing and displaying the blocking nodes after determining that the blocking nodes exist, calling a log analysis platform to perform statistical analysis on the abnormal logs, and executing the step D after obtaining an analysis result.
Step D: and calling an automatic restarting process of the monitoring tool 4 to restart the downtime node.
Therefore, the displayed abnormal index data can be analyzed, statistics can be realized on the error log, and whether the application node needs to be restarted or not can be automatically judged according to the analyzed result.
In the foregoing embodiment, the method for processing monitoring data provided in the embodiment of the present application is described, and in order to implement each function in the method provided in the embodiment of the present application, an electronic device as an execution body may include a hardware structure and/or a software module, and each function may be implemented in the form of a hardware structure, a software module, or a hardware structure and a software module. Some of the functions described above are performed in a hardware configuration, a software module, or a combination of hardware and software modules, depending on the specific application of the solution and design constraints.
For example, fig. 5 is a schematic structural diagram of a monitoring data processing apparatus according to an embodiment of the present application, and as shown in fig. 5, the apparatus includes: a receiving module 501, a recovering module 502, a judging module 503 and an analyzing module 504; the receiving module 501 is configured to receive monitoring data sent by a first monitoring tool, and determine, based on the monitoring data, whether an abnormality occurs in an application node monitored by the first monitoring tool; the first monitoring tool is used for monitoring the running state of the application node;
the recovery module 502 is configured to, after determining that an abnormality occurs in at least one application node monitored by the first monitoring tool, perform recovery processing on the at least one application node in which the abnormality occurs, to obtain a processing result;
The judging module 503 is configured to judge whether a blocking node exists in the at least one application node based on the processing result;
The analysis module 504 is configured to, after determining that a blocking node exists in the at least one application node, obtain an exception log corresponding to the blocking node to perform data analysis, and generate an analysis result, so as to remind a user to correct the blocking node based on the analysis result; the anomaly log includes monitoring indicator data for at least one type of monitoring tool.
Optionally, the device further includes a comparison module, where the comparison module is configured to:
acquiring monitoring data of a second monitoring tool; the second monitoring tool is used for monitoring whether the application node is down;
When the fact that the application node monitored by the second monitoring tool is down is determined based on the monitoring data of the second monitoring tool, obtaining a down node;
judging whether the downtime node is consistent with at least one application node with abnormality;
If yes, a third monitoring tool is called, and the downtime node is restarted; the third monitoring tool is used for restarting the application node which is down.
Optionally, the device further includes a first display module, where the first display module is configured to:
Searching position information of at least one application node with abnormality and the downtime node;
and graphically displaying the at least one abnormal application node and the downtime node based on the position information.
Optionally, the device further comprises a sending module; the sending module is used for:
And after determining that the downtime node is inconsistent with at least one abnormal application node, sending the at least one abnormal application node and the downtime node to a client for manual judgment so as to determine whether the at least one application node and the downtime node have normal application nodes.
Optionally, the analysis module 504 is specifically configured to:
Acquiring a history blocking record corresponding to a blocking node in the abnormal log; the history blocking record comprises the recovery time and the blocking times of the blocking node;
And when the blocking times are greater than a first threshold value and/or the recovery time is greater than a second threshold value, generating an analysis result based on the abnormal log.
Optionally, the apparatus further includes a restart module, where the restart module is configured to:
And after the analysis result is generated, a third monitoring tool is called, and the blocking node is restarted.
Optionally, the device further includes a second display module, where the second display module is configured to:
graphically displaying the monitoring index data of at least one type of monitoring tool; the monitoring index data includes: the method comprises the steps of interface call success rate, interface call response time, execution flow of each application node, memory occupation and CPU occupation of a central processing unit.
The specific implementation principle and effect of the monitoring data processing device provided by the embodiment of the present application can be referred to the related description and effect corresponding to the above embodiment, and will not be repeated here.
Exemplary, the embodiment of the present application further provides a schematic structural diagram of an electronic device, and fig. 6 is a schematic structural diagram of an electronic device provided by the embodiment of the present application, as shown in fig. 6, where the electronic device may include: a processor 601 and a memory 602 communicatively coupled to the processor; the memory 602 stores a computer program; the processor 601 executes the computer program stored in the memory 602, causing the processor 601 to perform the method as described in any one of the embodiments above.
Wherein the memory 602 and the processor 601 may be connected by a bus 603.
Embodiments of the present application also provide a computer-readable storage medium storing computer program-executable instructions that, when executed by a processor, are configured to implement a method as described in any of the foregoing embodiments of the present application.
The embodiment of the application also provides a chip for running instructions, and the chip is used for executing the method in any of the previous embodiments executed by the electronic equipment in any of the previous embodiments.
Embodiments of the present application also provide a computer program product comprising a computer program which, when executed by a processor, performs a method as in any of the preceding embodiments of the present application, as in any of the preceding embodiments performed by an electronic device.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules is merely a logical function division, and there may be additional divisions of actual implementation, e.g., multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.
The modules illustrated as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to implement the solution of this embodiment.
In addition, each functional module in the embodiments of the present application may be integrated in one processing unit, or each module may exist alone physically, or two or more modules may be integrated in one unit. The units formed by the modules can be realized in a form of hardware or a form of hardware and software functional units.
The integrated modules, which are implemented in the form of software functional modules, may be stored in a computer readable storage medium. The software functional modules described above are stored in a storage medium and include instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or processor to perform some of the steps of the methods described in the various embodiments of the application.
It should be appreciated that the processor may be a central processing unit (Central Processing Unit, abbreviated as CPU), or may be other general purpose processor, digital signal processor (DIGITAL SIGNAL processor, abbreviated as DSP), application SPECIFIC INTEGRATED Circuit (ASIC), or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present application may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in a processor for execution.
The memory may include a high-speed random access memory (Random Access memory, abbreviated as RAM), and may further include a non-volatile memory (NVM), such as at least one magnetic disk memory, and may also be a U-disk, a removable hard disk, a read-only memory, a magnetic disk, or an optical disk.
The bus may be an industry standard architecture (Industry Standard Architecture, ISA) bus, an external device interconnect (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, the buses in the drawings of the present application are not limited to only one bus or to one type of bus.
The storage medium may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as static random-access memory (SRAM), electrically erasable programmable read-only memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ ONLY MEMORY EEPROM), erasable programmable read-only memory (Erasable Programmable Read-only memory, EPROM), programmable read-only memory (Programmable Read-only memory, PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk, or optical disk. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an Application SPECIFIC INTEGRATED Circuits (ASIC). It is also possible that the processor and the storage medium reside as discrete components in an electronic device or a master device.
The foregoing is merely a specific implementation of the embodiment of the present application, but the protection scope of the embodiment of the present application is not limited to this, and any changes or substitutions within the technical scope disclosed in the embodiment of the present application should be covered in the protection scope of the embodiment of the present application. Therefore, the protection scope of the embodiments of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method of monitoring data processing, the method comprising:
receiving monitoring data sent by a first monitoring tool, and judging whether an application node monitored by the first monitoring tool is abnormal or not based on the monitoring data; the first monitoring tool is used for monitoring the running state of the application node;
When it is determined that the at least one application node monitored by the first monitoring tool is abnormal, recovering the at least one abnormal application node to obtain a processing result;
judging whether a blocking node exists in the at least one application node or not based on the processing result;
If yes, acquiring an abnormal log corresponding to the blocking node for data analysis, and generating an analysis result so as to remind a user to correct the blocking node based on the analysis result; the anomaly log includes monitoring indicator data for at least one type of monitoring tool.
2. The method according to claim 1, wherein the method further comprises:
acquiring monitoring data of a second monitoring tool; the second monitoring tool is used for monitoring whether the application node is down;
When the fact that the application node monitored by the second monitoring tool is down is determined based on the monitoring data of the second monitoring tool, obtaining a down node;
judging whether the downtime node is consistent with at least one application node with abnormality;
If yes, a third monitoring tool is called, and the downtime node is restarted; the third monitoring tool is used for restarting the application node which is down.
3. The method according to claim 2, wherein the method further comprises:
Searching position information of at least one application node with abnormality and the downtime node;
and graphically displaying the at least one abnormal application node and the downtime node based on the position information.
4. The method according to claim 2, wherein the method further comprises:
And after determining that the downtime node is inconsistent with at least one abnormal application node, sending the at least one abnormal application node and the downtime node to a client for manual judgment so as to determine whether the at least one application node and the downtime node have normal application nodes.
5. The method of claim 1, wherein obtaining the exception log corresponding to the blocking node for data analysis, generating an analysis result, comprises:
Acquiring a history blocking record corresponding to a blocking node in the abnormal log; the history blocking record comprises the recovery time and the blocking times of the blocking node;
And when the blocking times are greater than a first threshold value and/or the recovery time is greater than a second threshold value, generating an analysis result based on the abnormal log.
6. The method according to claim 1, wherein the method further comprises:
And after the analysis result is generated, a third monitoring tool is called, and the blocking node is restarted.
7. The method according to any one of claims 1-6, further comprising:
graphically displaying the monitoring index data of at least one type of monitoring tool; the monitoring index data includes: the method comprises the steps of interface call success rate, interface call response time, execution flow of each application node, memory occupation and CPU occupation of a central processing unit.
8. A monitoring data processing apparatus, the apparatus comprising:
The receiving module is used for receiving the monitoring data sent by the first monitoring tool and judging whether the application node monitored by the first monitoring tool is abnormal or not based on the monitoring data; the first monitoring tool is used for monitoring the running state of the application node;
The recovery module is used for carrying out recovery processing on at least one abnormal application node after determining that the at least one abnormal application node monitored by the first monitoring tool is abnormal, so as to obtain a processing result;
the judging module is used for judging whether a blocking node exists in the at least one application node or not based on the processing result;
The analysis module is used for acquiring an abnormal log corresponding to the blocking node for data analysis after determining that the blocking node exists in the at least one application node, and generating an analysis result so as to remind a user to correct the blocking node based on the analysis result; the anomaly log includes monitoring indicator data for at least one type of monitoring tool.
9. An electronic device, comprising: a processor, and a memory communicatively coupled to the processor;
the memory stores computer-executable instructions;
The processor executes computer-executable instructions stored in the memory to implement the method of any one of claims 1-7.
10. A computer readable storage medium storing computer executable instructions which when executed by a processor are adapted to carry out the method of any one of claims 1 to 7.
CN202311747694.0A 2023-12-18 2023-12-18 Method, device, equipment and storage medium for processing monitoring data Pending CN117909111A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311747694.0A CN117909111A (en) 2023-12-18 2023-12-18 Method, device, equipment and storage medium for processing monitoring data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311747694.0A CN117909111A (en) 2023-12-18 2023-12-18 Method, device, equipment and storage medium for processing monitoring data

Publications (1)

Publication Number Publication Date
CN117909111A true CN117909111A (en) 2024-04-19

Family

ID=90691378

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311747694.0A Pending CN117909111A (en) 2023-12-18 2023-12-18 Method, device, equipment and storage medium for processing monitoring data

Country Status (1)

Country Link
CN (1) CN117909111A (en)

Similar Documents

Publication Publication Date Title
CN110213068B (en) Message middleware monitoring method and related equipment
CN110351150B (en) Fault source determination method and device, electronic equipment and readable storage medium
CN110661659A (en) Alarm method, device and system and electronic equipment
CN106789306B (en) Method and system for detecting, collecting and recovering software fault of communication equipment
CN107659431A (en) Interface processing method, apparatus, storage medium and processor
JP5704234B2 (en) Message determination device and message determination program
CN109088773B (en) Fault self-healing method and device, server and storage medium
US11799748B2 (en) Mitigating failure in request handling
CN112671767A (en) Security event early warning method and device based on alarm data analysis
CN110795261A (en) Virtual disk fault detection method and device
CN103763143A (en) Method and system for equipment abnormality alarming based on storage server
CN111835566A (en) System fault management method, device and system
CN104765672A (en) Error code monitoring method, device and equipment
CN116136801A (en) Data processing method, device, electronic device and storage medium of cloud platform
KR101288535B1 (en) Method for monitoring communication system and apparatus therefor
CN117909111A (en) Method, device, equipment and storage medium for processing monitoring data
CN113067722A (en) Data management platform and working method thereof
CN115102838B (en) Emergency processing method and device for server downtime risk and electronic equipment
CN117614853A (en) Alarm monitoring method, system, equipment and medium in cloud primary environment
CN116483663A (en) Abnormality warning method and device for platform
CN114679295B (en) Firewall security configuration method and device
CN110704273A (en) Configuration information processing method and device, electronic equipment and storage medium
CN108959024A (en) A kind of cluster monitoring method and apparatus
CN114610560B (en) System abnormality monitoring method, device and storage medium
CN112131090B (en) Service system performance monitoring method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination