[go: up one dir, main page]

CN112699007B - Method, system, network device and storage medium for monitoring machine performance - Google Patents

Method, system, network device and storage medium for monitoring machine performance Download PDF

Info

Publication number
CN112699007B
CN112699007B CN202110003255.3A CN202110003255A CN112699007B CN 112699007 B CN112699007 B CN 112699007B CN 202110003255 A CN202110003255 A CN 202110003255A CN 112699007 B CN112699007 B CN 112699007B
Authority
CN
China
Prior art keywords
alarm
abnormal
category
information
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110003255.3A
Other languages
Chinese (zh)
Other versions
CN112699007A (en
Inventor
陈文娟
王昱丹
陈宇鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wangsu Science and Technology Co Ltd
Original Assignee
Wangsu Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wangsu Science and Technology Co Ltd filed Critical Wangsu Science and Technology Co Ltd
Priority to CN202110003255.3A priority Critical patent/CN112699007B/en
Publication of CN112699007A publication Critical patent/CN112699007A/en
Application granted granted Critical
Publication of CN112699007B publication Critical patent/CN112699007B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3452Performance evaluation by statistical analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the invention relates to the technical field of computers, and discloses a method, a system, network equipment and a storage medium for monitoring machine performance. The method comprises the following steps: acquiring respective resource operation data of at least two monitored machines in a preset period, wherein the resource operation data comprises alarms generated when each resource block in the machine operates and alarm detection information corresponding to the alarms, and acquiring tag information of each alarm in the resource operation data, wherein the tag information is used for indicating the category of the alarms; acquiring alarms belonging to abnormal categories from the resource operation data as abnormal alarms according to the label information of each alarm; determining analysis data of each abnormal category according to alarm detection information corresponding to the abnormal alarm and the abnormal alarm; and visualizing the acquired abnormal alarm and/or analysis data according to the received display instruction. By adopting the implementation mode, the performance fault can be positioned in time, and the positioning performance fault can be eliminated conveniently and quickly.

Description

监控机器性能的方法、系统、网络设备及存储介质Method, system, network device and storage medium for monitoring machine performance

技术领域Technical Field

本发明实施例涉及计算机技术领域,特别涉及一种监控机器性能的方法、系统、网络设备及存储介质。Embodiments of the present invention relate to the field of computer technology, and in particular to a method, system, network device and storage medium for monitoring machine performance.

背景技术Background Art

企业拥有大量的机器,机器的数量庞大,若出现性能问题无法及时定位并解决,该性能问题将持续存在,影响机器的运行,可见,监控机器性能是十分必要的。机器性能包括机器软件层面的问题、硬件问题。从软件层面上来说,CPU、内存、磁盘IO、负载是影响机器性能的重要指标,因此,目前的软件性能监控通常是收集机器软件层面上的指标数据信息,并基于采集的指标数据信息分析机器的性能。Enterprises have a large number of machines. If performance problems cannot be located and solved in time, they will continue to exist and affect the operation of the machine. Therefore, it is necessary to monitor machine performance. Machine performance includes problems at the machine software level and hardware problems. From the software level, CPU, memory, disk IO, and load are important indicators that affect machine performance. Therefore, current software performance monitoring usually collects indicator data information at the machine software level and analyzes the machine performance based on the collected indicator data information.

然而,目前对机器的性能分析非常简单,并不利于及时定位问题并解除该问题。例如,机器性能监控通常提取CPU跑高、CPU单核跑高等信息并输出;但是,未对采集的指标数据信息进行详细的分析,导致工作人员/电子设备无法通过输出的信息及时消除性能问题。However, the current performance analysis of machines is very simple and is not conducive to timely locating and resolving problems. For example, machine performance monitoring usually extracts and outputs information such as CPU high performance and CPU single-core high performance; however, the collected indicator data information is not analyzed in detail, resulting in the inability of staff/electronic equipment to eliminate performance problems in a timely manner through the output information.

发明内容Summary of the invention

本发明实施方式的目的在于提供一种监控机器性能的方法、系统、网络设备及存储介质,使得可以及时定位性能故障,便于快速消除性能故障,减少故障网络设备的影响时长。The purpose of the embodiments of the present invention is to provide a method, system, network device and storage medium for monitoring machine performance, so that performance faults can be located in a timely manner, performance faults can be quickly eliminated, and the impact time of faulty network devices can be reduced.

为解决上述技术问题,本发明的实施方式提供了一种监控机器性能的方法,包括:获取预设周期内至少两个被监控的机器各自的资源运行数据,资源运行数据包括所述机器中每个资源块运行时产生的告警以及与所述告警对应的告警检测信息,所述告警检测信息包括以下信息的任意组合:所述告警持续时长的信息、持续次数信息、所述告警触发时的流量信息;获取所述资源运行数据中每个告警的标签信息,所述标签信息用于指示所述告警的告警所属类别,由各所述机器根据所述告警的类别标注获得;根据每个所述告警的标签信息,从所述资源运行数据中获取属于异常类别的所述告警作为异常告警;根据异常告警对应的告警检测信息以及异常告警,确定每个异常类别的分析数据,分析数据包括:异常类别的异常告警所在机器的信息;根据接收的显示指令,对获取的异常告警和/或所述分析数据进行可视化。To solve the above technical problems, an embodiment of the present invention provides a method for monitoring machine performance, comprising: obtaining resource operation data of each of at least two monitored machines within a preset period, the resource operation data including alarms generated when each resource block in the machine is running and alarm detection information corresponding to the alarm, the alarm detection information including any combination of the following information: information on the duration of the alarm, information on the number of times the alarm lasts, and traffic information when the alarm is triggered; obtaining label information of each alarm in the resource operation data, the label information is used to indicate the category to which the alarm belongs, and is obtained by each of the machines according to the category labeling of the alarm; according to the label information of each of the alarms, obtaining the alarm belonging to the abnormal category from the resource operation data as an abnormal alarm; according to the alarm detection information corresponding to the abnormal alarm and the abnormal alarm, determining the analysis data of each abnormal category, the analysis data including: information on the machine where the abnormal alarm of the abnormal category is located; according to the received display instruction, visualizing the acquired abnormal alarm and/or the analysis data.

本发明的实施方式还提供了一种监控机器性能的方法,包括:获取所述机器中每个资源块运行时产生的资源数据,所述资源数据包括告警;根据每个资源块对应的分析策略,获取每个所述资源块中告警的告警检测信息以及为每个资源数据中告警添加标签信息,标签信息用于指示所述告警的告警所属类别;根据告警检测信息以及资源数据,确定机器的资源运行数据,以供监控平台执行监控机器性能的方法。An embodiment of the present invention also provides a method for monitoring machine performance, including: obtaining resource data generated when each resource block in the machine is running, the resource data including alarms; according to the analysis strategy corresponding to each resource block, obtaining alarm detection information of the alarm in each resource block and adding label information to the alarm in each resource data, the label information is used to indicate the alarm category to which the alarm belongs; according to the alarm detection information and the resource data, determining the resource operation data of the machine for the monitoring platform to execute the method for monitoring machine performance.

本发明的实施方式还提供了一种监控机器性能的系统,包括:用于上述的方法的监控平台,以及用于执行上述监控机器性能的方法的机器。An embodiment of the present invention further provides a system for monitoring machine performance, including: a monitoring platform for the above method, and a machine for executing the above method for monitoring machine performance.

本申请实施例中监控机器性能的方法,通过获取预设周期内的待分析的各资源运行数据,使得获得的资源运行数据可以反映出在一个时间段内被监控的机器在运行时的告警以及该告警对应的告警检测信息,可以在该时间段内不断复现相同的异常问题,便于后续对相同问题的准确定位;且告警检测信息包括以下信息的任意组合:告警持续时长的信息、持续次数信息、告警触发时的流量信息,挖掘出长时间处于较差状态的告警,使得该告警检测信息可以反映出告警对机器产生的影响,进而便于后续对告警进行统计、分析;由于资源运行数据不只包含告警还包括告警对应的告警检测信息,便于监控平台进行故障分析;根据异常告警的告警检测信息以及该异常告警,确定分析数据,而不只是单方面获取告警数据,使得通过分析数据和异常告警结合的显示,利于工作人员对异常告警的定位;而由于对海量的机器的监控,包括多个机器的资源运行数据,每个告警有对应的标签信息,该标签信息指示该告警的所属类别,基于标签信息可以将属于同一异常类别的告警筛选出来,使得监控平台可以根据告警的标签信息从多个被监控的机器中获取出每个异常类别的异常告警,由于不是获取单个机器的异常告警,有利于后续工作人员根据异常告警所在机器的信息定位出准确的异常原因,且通过异常告警的标签信息,即可获取每个异常告警的类别,无需再次进行分类,减少异常告警分类的时间,也便于根据异常告警的类别及时定位出异常位置,减少异常告警持续的时长。The method for monitoring machine performance in the embodiment of the present application obtains the operation data of each resource to be analyzed within a preset period, so that the obtained resource operation data can reflect the alarm of the monitored machine during operation within a time period and the alarm detection information corresponding to the alarm, and can continuously reproduce the same abnormal problem within the time period, which is convenient for the subsequent accurate positioning of the same problem; and the alarm detection information includes any combination of the following information: information on the duration of the alarm, information on the number of times the alarm lasts, and traffic information when the alarm is triggered, so as to mine alarms that have been in a poor state for a long time, so that the alarm detection information can reflect the impact of the alarm on the machine, and then facilitate the subsequent statistics and analysis of the alarm; because the resource operation data not only includes the alarm but also includes the alarm detection information corresponding to the alarm, it is convenient for the monitoring platform to perform fault analysis; according to the alarm detection information of the abnormal alarm and the abnormal alarm, the analysis data is determined, and It is not just a unilateral acquisition of alarm data, but also a display combining analysis data and abnormal alarms, which is helpful for staff to locate abnormal alarms. Due to the monitoring of a large number of machines, including the resource operation data of multiple machines, each alarm has corresponding label information, which indicates the category of the alarm. Based on the label information, alarms belonging to the same abnormal category can be filtered out, so that the monitoring platform can obtain abnormal alarms of each abnormal category from multiple monitored machines according to the label information of the alarm. Since the abnormal alarm of a single machine is not obtained, it is beneficial for subsequent staff to locate the accurate cause of the abnormality according to the information of the machine where the abnormal alarm is located, and the category of each abnormal alarm can be obtained through the label information of the abnormal alarm, without the need for re-classification, reducing the time for abnormal alarm classification, and facilitating timely positioning of the abnormal position according to the category of the abnormal alarm, reducing the duration of the abnormal alarm.

另外,根据每个告警的标签信息,从资源运行数据中获取属于异常类别的告警作为异常告警,包括:若所述标签信息用于指示所述告警属于流量告警类别,则获取每分钟连接数累计值超过连接数阈值的告警作为所述异常告警,或者,获取每分钟流入的流量超过流入流量阈值的所述告警作为所述异常告警,或者,获取每分钟流出的流量超过流出阈值的所述告警作为所述异常告警;若所述标签信息用于指示所述告警属于硬件告警类别,则将所述标签信息对应的所述告警作为所述异常告警;若所述告警标签用于指示所述告警属于处理器跑高类别,则获取在当前周期内跑高次数超过跑高阈值的所述告警作为所述异常告警;若所述告警标签包括指示所述告警属于进程跑高类别,则获取在当前周期内进程跑高次数超过进程跑高阈值的所述告警作为所述异常告警。通过对不同标签信息对应的告警进行处理,可以针对性的快速过滤出对告警分析有用的数据,删除与异常告警关联小的告警,减少后续分析的数据量。In addition, according to the label information of each alarm, alarms belonging to the abnormal category are obtained from the resource operation data as abnormal alarms, including: if the label information is used to indicate that the alarm belongs to the traffic alarm category, then an alarm that the cumulative value of the number of connections per minute exceeds the connection number threshold is obtained as the abnormal alarm, or, the alarm that the inflow flow per minute exceeds the inflow flow threshold is obtained as the abnormal alarm, or, the alarm that the outflow flow per minute exceeds the outflow threshold is obtained as the abnormal alarm; if the label information is used to indicate that the alarm belongs to the hardware alarm category, then the alarm corresponding to the label information is used as the abnormal alarm; if the alarm label is used to indicate that the alarm belongs to the processor running high category, then the alarm that the number of running high times exceeds the running high threshold in the current cycle is obtained as the abnormal alarm; if the alarm label includes an indication that the alarm belongs to the process running high category, then the alarm that the number of process running high times exceeds the process running high threshold in the current cycle is obtained as the abnormal alarm. By processing the alarms corresponding to different tag information, you can quickly filter out data that is useful for alarm analysis, delete alarms that are less related to abnormal alarms, and reduce the amount of data for subsequent analysis.

另外,根据异常告警对应的告警检测信息以及异常告警,确定每个异常类别相关的分析数据,包括:根据每个告警检测信息以及异常告警,统计每个异常类别中的异常告警的持续时长、异常告警所在机器的位置信息和异常告警所在机器的类型分布信息;或者,根据每个告警检测信息以及异常告警,获取满足预设分析条件的进程作为关键进程,统计关键进程中异常告警的持续时长、每个异常告警所在机器的类型和所述异常告警所在机器的位置信息;或者,根据每个告警检测信息以及异常告警,统计异常告警在周期内每个指定时段内的机器数目、用于指示机器数目占被监控机器的总数的占比信息;或者,根据每个告警检测信息以及异常告警,统计每个异常类别分布的机器类型、异常类别分布的机器的内核版本信息、异常类别所在产品线信息、或者异常进程;或者,根据每个告警检测信息以及异常告警,根据统计的异常告警,获取运行最差的K个机器的信息,K为大于1的整数。根据告警检测信息和异常告警,获取不同的分析数据,为后续故障原因的定位提供数据,以满足不同的故障定位需求。In addition, according to the alarm detection information and abnormal alarm corresponding to the abnormal alarm, the analysis data related to each abnormal category is determined, including: according to each alarm detection information and abnormal alarm, the duration of the abnormal alarm in each abnormal category, the location information of the machine where the abnormal alarm is located, and the type distribution information of the machine where the abnormal alarm is located; or, according to each alarm detection information and abnormal alarm, the process that meets the preset analysis conditions is obtained as the key process, and the duration of the abnormal alarm in the key process, the type of the machine where each abnormal alarm is located, and the location information of the machine where the abnormal alarm is located; or, according to each alarm detection information and abnormal alarm, the number of machines in each specified time period within the cycle, and the information indicating the proportion of the number of machines to the total number of monitored machines are counted; or, according to each alarm detection information and abnormal alarm, the machine type distributed in each abnormal category, the kernel version information of the machine distributed in the abnormal category, the product line information of the abnormal category, or the abnormal process are counted; or, according to each alarm detection information and abnormal alarm, according to the statistical abnormal alarm, the information of the K worst-performing machines is obtained, where K is an integer greater than 1. According to the alarm detection information and abnormal alarms, different analysis data are obtained to provide data for the subsequent location of the fault cause to meet different fault location requirements.

另外,所述方法还包括:获取所述机器在产生所述异常告警时的质量变化数据,所述质量变化数据用于指示所述机器运行过程中质量指标数据变化的数据;关联所述质量变化数据和所述异常告警;可视化关联的所述质量变化数据和所述异常告警。可视化关联的质量变化数据和异常告警,从而可以直观的查看对质量影响较大的异常告警,以便工作人员重点关注。In addition, the method further includes: obtaining quality change data of the machine when the abnormal alarm is generated, the quality change data being used to indicate the change of quality indicator data during the operation of the machine; associating the quality change data with the abnormal alarm; and visualizing the associated quality change data and the abnormal alarm. By visualizing the associated quality change data and abnormal alarm, the abnormal alarms that have a greater impact on quality can be intuitively viewed so that the staff can focus on them.

另外,获取预设周期内至少两个被监控的机器各自的资源运行数据,包括:从汇总数据库中获取预设周期内指定机器的资源运行数据,汇总数据库用于接收各个被监控的机器上报的资源运行数据。设置汇总数据库用于存储各个被监控的机器上报的资源运行数据,而不是直接将各个被监控的机器的资源运行数据存储在监控平台,可以减少后续待处理的数据量,同时将各个机器的资源运行数据存储至汇总数据库,也便于从汇总数据库获取其他机器的资源运行数据。In addition, obtaining the resource operation data of at least two monitored machines within a preset period includes: obtaining the resource operation data of the specified machine within the preset period from a summary database, and the summary database is used to receive the resource operation data reported by each monitored machine. Setting up a summary database to store the resource operation data reported by each monitored machine, rather than directly storing the resource operation data of each monitored machine in the monitoring platform, can reduce the amount of data to be processed later. At the same time, storing the resource operation data of each machine in the summary database also facilitates obtaining the resource operation data of other machines from the summary database.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

一个或多个实施例通过与之对应的附图中的图片进行示例性说明,这些示例性说明并不构成对实施例的限定,附图中具有相同参考数字标号的元件表示为类似的元件,除非有特别申明,附图中的图不构成比例限制。One or more embodiments are exemplarily described by pictures in the corresponding drawings, and these exemplified descriptions do not constitute limitations on the embodiments. Elements with the same reference numerals in the drawings represent similar elements, and unless otherwise stated, the figures in the drawings do not constitute proportional limitations.

图1是根据本发明第一实施例中提供的一种监控机器性能的方法的流程图;FIG1 is a flow chart of a method for monitoring machine performance provided in a first embodiment of the present invention;

图2是根据本发明第二实施例中提供的一种监控机器性能的方法的流程图;2 is a flow chart of a method for monitoring machine performance provided in a second embodiment of the present invention;

图3是根据本发明第三实施例中提供的一种监控机器性能的方法的中表2对应的曲线图;FIG3 is a graph corresponding to Table 2 in a method for monitoring machine performance provided in a third embodiment of the present invention;

图4是根据本发明第三实施例中提供的一种监控机器性能的方法的流程图;4 is a flow chart of a method for monitoring machine performance provided in a third embodiment of the present invention;

图5是根据本发明第四实施例中提供的一种监控机器性能的方法的流程图;5 is a flow chart of a method for monitoring machine performance provided in a fourth embodiment of the present invention;

图6是根据本发明第五实施例中提供的一种获取告警因素的实现流程图;6 is a flowchart of an implementation of obtaining an alarm factor according to a fifth embodiment of the present invention;

图7是根据本发明第五实施例中提供的一种获取告警因素的实现流程图;7 is a flowchart of an implementation of obtaining an alarm factor according to a fifth embodiment of the present invention;

图8是根据本发明第五实施例中提供的一种获取告警因素的实现流程图;8 is a flowchart of an implementation of obtaining an alarm factor according to a fifth embodiment of the present invention;

图9是根据本发明第六实施例中提供的一种监控机器性能的系统框图;9 is a block diagram of a system for monitoring machine performance provided in a sixth embodiment of the present invention;

图10是根据本发明第七实施例中提供的一种网络设备的结构框图。FIG. 10 is a structural block diagram of a network device provided according to a seventh embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合附图对本发明的各实施方式进行详细的阐述。然而,本领域的普通技术人员可以理解,在本发明各实施方式中,为了使读者更好地理解本申请而提出了许多技术细节。但是,即使没有这些技术细节和基于以下各实施方式的种种变化和修改,也可以实现本申请所要求保护的技术方案。In order to make the purpose, technical scheme and advantages of the embodiments of the present invention clearer, the various embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings. However, it will be appreciated by those skilled in the art that in the various embodiments of the present invention, many technical details are provided in order to enable the reader to better understand the present application. However, even without these technical details and various changes and modifications based on the following embodiments, the technical scheme claimed in the present application can be implemented.

以下各个实施例的划分是为了描述方便,不应对本发明的具体实现方式构成任何限定,各个实施例在不矛盾的前提下可以相互结合相互引用。The following embodiments are divided for the convenience of description and shall not constitute any limitation on the specific implementation of the present invention. The embodiments may be combined with each other and referenced to each other without contradiction.

本发明的第一实施方式涉及一种监控机器性能的方法。应用于监控平台,其流程如图1所示:The first embodiment of the present invention relates to a method for monitoring machine performance. Applied to a monitoring platform, the process is shown in FIG1 :

步骤101:获取预设周期内至少两个被监控的机器各自的资源运行数据,资源运行数据包括机器中每个资源块运行时产生的告警以及与告警对应的告警检测信息,告警检测信息包括以下信息的任意组合:告警持续时长的信息、持续次数信息、告警触发时的流量信息。Step 101: Obtain resource operation data of at least two monitored machines within a preset period. The resource operation data includes the alarms generated when each resource block in the machine is running and the alarm detection information corresponding to the alarm. The alarm detection information includes any combination of the following information: information on the duration of the alarm, the number of durations, and the traffic information when the alarm is triggered.

步骤102:获取资源运行数据中每个告警的标签信息,标签信息用于指示告警的告警所属类别,由各机器根据告警的类别标注获得。Step 102: Obtain label information of each alarm in the resource operation data. The label information is used to indicate the category to which the alarm belongs, and is obtained by each machine according to the category label of the alarm.

步骤103:根据每个告警的标签信息,从资源运行数据中获取属于异常类别的告警作为异常告警。Step 103: According to the label information of each alarm, alarms belonging to the abnormal category are obtained from the resource operation data as abnormal alarms.

步骤104:根据异常告警对应的告警检测信息以及异常告警,确定每个异常类别的分析数据,分析数据包括:异常类别的异常告警所在机器的信息。Step 104: Determine analysis data for each abnormality category based on the alarm detection information corresponding to the abnormal alarm and the abnormal alarm, where the analysis data includes: information on the machine where the abnormal alarm of the abnormal category is located.

步骤105:根据接收的显示指令,对获取的异常告警和/或分析数据进行可视化。Step 105: Visualize the acquired abnormal alarm and/or analysis data according to the received display instruction.

本申请实施例中监控机器性能的方法,通过获取预设周期内的待分析的各资源运行数据,使得获得的资源运行数据可以反映出在一个时间段内被监控的机器在运行时的告警以及该告警对应的告警检测信息,可以在该时间段内不断复现相同的异常问题,便于后续对相同问题的准确定位;且告警检测信息包括以下信息的任意组合:告警持续时长的信息、持续次数信息、告警触发时的流量信息,挖掘出长时间处于较差状态的告警,使得该告警检测信息可以反映出告警对机器产生的影响,进而便于后续对告警进行统计、分析;由于资源运行数据不只包含告警还包括告警对应的告警检测信息,便于监控平台进行故障分析;根据异常告警的告警检测信息以及该异常告警,确定分析数据,而不只是单方面获取告警数据,使得通过分析数据和异常告警结合的显示,利于工作人员对异常告警的定位;而由于对海量的机器的监控,包括多个机器的资源运行数据,每个告警有对应的标签信息,该标签信息指示该告警的所属类别,基于标签信息可以将属于同一异常类别的告警筛选出来,使得监控平台可以根据告警的标签信息从多个被监控的机器中获取出每个异常类别的异常告警,由于不是获取单个机器的异常告警,有利于后续工作人员根据异常告警所在机器的信息定位出准确的异常原因,且通过异常告警的标签信息,即可获取每个异常告警的类别,无需再次进行分类,减少异常告警分类的时间,也便于根据异常告警的类别及时定位出异常位置,减少异常告警持续的时长。The method for monitoring machine performance in the embodiment of the present application obtains the operation data of each resource to be analyzed within a preset period, so that the obtained resource operation data can reflect the alarm of the monitored machine during operation within a time period and the alarm detection information corresponding to the alarm, and can continuously reproduce the same abnormal problem within the time period, which is convenient for the subsequent accurate positioning of the same problem; and the alarm detection information includes any combination of the following information: information on the duration of the alarm, information on the number of times the alarm lasts, and traffic information when the alarm is triggered, so as to mine alarms that have been in a poor state for a long time, so that the alarm detection information can reflect the impact of the alarm on the machine, and then facilitate the subsequent statistics and analysis of the alarm; because the resource operation data not only includes the alarm but also includes the alarm detection information corresponding to the alarm, it is convenient for the monitoring platform to perform fault analysis; according to the alarm detection information of the abnormal alarm and the abnormal alarm, the analysis data is determined, and It is not just a unilateral acquisition of alarm data, but also a display combining analysis data and abnormal alarms, which is helpful for staff to locate abnormal alarms. Due to the monitoring of a large number of machines, including the resource operation data of multiple machines, each alarm has corresponding label information, which indicates the category of the alarm. Based on the label information, alarms belonging to the same abnormal category can be filtered out, so that the monitoring platform can obtain abnormal alarms of each abnormal category from multiple monitored machines according to the label information of the alarm. Since the abnormal alarm of a single machine is not obtained, it is helpful for subsequent staff to locate the accurate cause of the abnormality according to the information of the machine where the abnormal alarm is located, and the category of each abnormal alarm can be obtained through the label information of the abnormal alarm, without the need for re-classification, reducing the time for abnormal alarm classification, and facilitating timely positioning of the abnormal position according to the category of the abnormal alarm, reducing the duration of the abnormal alarm.

本发明的第二实施方式涉及一种监控机器性能的方法。第二实施方式是对第一实施方式的详细介绍,其流程如图2所示:The second embodiment of the present invention relates to a method for monitoring machine performance. The second embodiment is a detailed introduction to the first embodiment, and its process is shown in Figure 2:

步骤201:获取预设周期内至少两个被监控的机器各自的资源运行数据。Step 201: Obtain resource operation data of at least two monitored machines within a preset period.

具体地,本示例中的监控机器性能的方法应用于监控平台,监控平台可以是一台服务器,也可以是至少2台服务器的组合,该监控平台可以设置有对应的分析数据库。该监控平台可以直接与被监控的机器通信连接,该监控平台接收各个被监控机器上传的资源运行数据,可以将资源运行数据存储至该分析数据库内,从该分析数据库中获取指定机器在预设周期内运行的资源运行数据。该监控平台还可以直接接收来自指定机器在预设周期内运行的资源运行数据。其中,本示例中的指定机器可以根据接收的待分析的机器的IP地址信息确定。预设周期可以是1个周期、2个周期及以上。资源运行数据包括:告警以及该告警对应的告警检测信息,告警检测信息包括以下信息的任意组合:告警持续时长的信息、持续次数信息、所述告警触发时的流量信息,该告警检测信息是由被监控的机器获取。Specifically, the method for monitoring machine performance in this example is applied to a monitoring platform, which can be a server or a combination of at least two servers, and the monitoring platform can be provided with a corresponding analysis database. The monitoring platform can be directly connected to the monitored machine for communication, and the monitoring platform receives resource operation data uploaded by each monitored machine, and can store the resource operation data in the analysis database, and obtain the resource operation data of the specified machine running within a preset period from the analysis database. The monitoring platform can also directly receive resource operation data from the specified machine running within a preset period. Among them, the specified machine in this example can be determined based on the IP address information of the machine to be analyzed received. The preset period can be 1 period, 2 periods or more. The resource operation data includes: an alarm and the alarm detection information corresponding to the alarm, and the alarm detection information includes any combination of the following information: information on the duration of the alarm, information on the number of times the alarm lasts, and traffic information when the alarm is triggered, and the alarm detection information is obtained by the monitored machine.

需要说明的是,被监控的机器可以是服务器,该服务器上运行本示例中用于获取资源运行数据的应用,该告警可以包括:当前服务器中CPU的告警、I/O告警、内存的告警以及负载的告警等,还可以包括时间、服务器的IP地址、使用率等信息。It should be noted that the monitored machine can be a server, on which runs the application used in this example to obtain resource operation data. The alarm may include: CPU alarm, I/O alarm, memory alarm and load alarm in the current server, and may also include time, server IP address, usage rate and other information.

步骤202:获取资源运行数据中每个告警的标签信息,标签信息用于指示告警的告警所属类别,由各机器根据告警的类别标注获得。Step 202: Obtain label information of each alarm in the resource operation data. The label information is used to indicate the category to which the alarm belongs, and is obtained by each machine according to the category label of the alarm.

具体地,资源运行数据中每个告警都有对应的标签信息,该标签信息是由各个被监控的机器标注获得,标签信息用于指示告警的所属类别。例如,本示例中将HW字段作为硬件告警的类别,再如标签信息还可以是MEMORY、CPU、IO等;标签信息为HW_PARAM表示对应检测到的硬件中存储器的具体异常的类别,如LOW_CPU_FREQ、DISK_ERROR、NUMA_TOTAL_UNBALANCE等;SW标签表示发生过的与软件相关的问题类别,如HIGH_CPU_PART、MEMORY_SUDDEN_CHANGE等,SW_PARAM是对应检测到的SW问题的具体值的类别;SW_MORE标签信息表示与软件相关问题的详细类别,该SW_MORE具体包含两种含义,分别是表示在一个周期内的平均每分钟流量和连接数,若出现CPU异常且检测到跑高的进程,则该SW_MORE表示在一个周期内的平均每分钟流量和连接数,同时还包含跑高进程信息。Specifically, each alarm in the resource operation data has corresponding label information, which is obtained by labeling each monitored machine, and the label information is used to indicate the category to which the alarm belongs. For example, in this example, the HW field is used as the category of the hardware alarm, and the label information can also be MEMORY, CPU, IO, etc.; the label information is HW_PARAM, which indicates the category of the specific abnormality of the memory in the corresponding hardware detected, such as LOW_CPU_FREQ, DISK_ERROR, NUMA_TOTAL_UNBALANCE, etc.; the SW label indicates the category of software-related problems that have occurred, such as HIGH_CPU_PART, MEMORY_SUDDEN_CHANGE, etc., and SW_PARAM is the category of the specific value of the detected SW problem; the SW_MORE label information indicates the detailed category of software-related problems, and the SW_MORE specifically includes two meanings, namely, the average per-minute traffic and number of connections in a cycle. If a CPU abnormality occurs and a high-running process is detected, the SW_MORE indicates the average per-minute traffic and number of connections in a cycle, and also includes the high-running process information.

步骤203:根据每个告警的标签信息,从资源运行数据中获取属于异常类别的告警作为异常告警。Step 203: According to the label information of each alarm, alarms belonging to the abnormal category are obtained from the resource operation data as abnormal alarms.

具体地,由于标签信息具有不同的含义,通过不同的标签信息,可以获取对应的异常告警,下面将介绍几种获取异常告警的过程。Specifically, since the tag information has different meanings, the corresponding abnormal alarm can be obtained through different tag information. Several processes for obtaining abnormal alarms are introduced below.

在一个例子中,若标签信息用于指示告警属于流量告警类别,则获取每分钟连接数累计值超过连接数阈值的告警作为异常告警,或者,获取每分钟流入的流量超过流入流量阈值的告警作为所述异常告警,或者,获取每分钟流出的流量超过流出阈值的告警作为异常告警。In one example, if the label information is used to indicate that the alarm belongs to the traffic alarm category, then an alarm that the cumulative value of the number of connections per minute exceeds the connection number threshold is obtained as an abnormal alarm, or an alarm that the incoming traffic per minute exceeds the incoming traffic threshold is obtained as the abnormal alarm, or an alarm that the outgoing traffic per minute exceeds the outgoing threshold is obtained as an abnormal alarm.

具体地,机器的性能数据通常与机器的流量跑高数据相关,故本示例中通过设置流入流量阈值、流出流量阈值以及连接数阈值,过滤出满足用于分析的异常告警。例如,若识别到标签信息包括:TRAFFIC_PER_MIN_xK|yMB|zMB的字段,x表示每分钟连接数累计值、y表示每分钟流入流量、z表示每分钟流出流量,分别获取x,y或z所指示的值,若满足x>3||y>500||z>500,即出现每分钟连接数累计值>3K或者出现每分钟流入流量>500MB或出现每分钟流出流量>500MB;则确定该告警属于异常告警。Specifically, the performance data of the machine is usually related to the high traffic data of the machine, so in this example, by setting the inflow traffic threshold, outflow traffic threshold and connection number threshold, the abnormal alarms that meet the analysis are filtered out. For example, if the tag information includes the field of TRAFFIC_PER_MIN_xK|yMB|zMB, x represents the cumulative value of the number of connections per minute, y represents the inflow traffic per minute, and z represents the outflow traffic per minute, the values indicated by x, y or z are obtained respectively. If x>3||y>500||z>500 is satisfied, that is, the cumulative value of the number of connections per minute>3K or the inflow traffic per minute>500MB or the outflow traffic per minute>500MB; then it is determined that the alarm is an abnormal alarm.

值得一提的是,由于在预设周期内,存在出现流量跑高的异常情况的次数多,针对整个监控系统,表明在该异常问题在预设周期内不断复现,该异常情况对整个系统影响较大,需要进行分析,以减少相同类别告警的出现次数,故本示例中通过设置流入流量阈值、流出流量阈值以及连接数阈值,筛选出流量跑高异常的告警作为异常告警,提高后续分析的效率。It is worth mentioning that due to the large number of abnormal high traffic conditions within the preset period, for the entire monitoring system, it indicates that the abnormal problem continues to recur within the preset period. The abnormal situation has a greater impact on the entire system and needs to be analyzed to reduce the number of alarms of the same category. Therefore, in this example, by setting the inflow traffic threshold, outflow traffic threshold and connection number threshold, the alarms of abnormal high traffic conditions are screened out as abnormal alarms to improve the efficiency of subsequent analysis.

在另一个例子中,若标签信息用于指示告警属于硬件告警类别,则将标签信息对应的告警作为异常告警。例如,标签信息包含HW、HW_PARAM字段,则将该标签信息对应的告警作为异常告警,其中,HW字段表示硬件问题。In another example, if the tag information is used to indicate that the alarm belongs to the hardware alarm category, the alarm corresponding to the tag information is used as an abnormal alarm. For example, if the tag information includes HW and HW_PARAM fields, the alarm corresponding to the tag information is used as an abnormal alarm, wherein the HW field indicates a hardware problem.

在另一个例子中,若标签信息用于指示告警属于处理器跑高类别,则获取在当前周期内跑高次数超过跑高阈值的告警作为所述异常告警。In another example, if the tag information is used to indicate that the alarm belongs to the category of processor high running, then the alarm whose high running times in the current cycle exceeds the high running threshold is obtained as the abnormal alarm.

具体地,本示例中SW_PARAM字段中的HIGH_CPU_PART对应的容易出现跑高的告警,但跑高总次数较少时对被监控的机器的影响较小,因此需要滤除,可以预先设置跑高阈值,当HIGH_CPU_PART中总跑高次数超过预设的跑高阈值,则将该告警最为异常告警。而若检测到SW_PARAM的其他项标签,则直接将对应的告警作为异常告警。Specifically, in this example, the HIGH_CPU_PART in the SW_PARAM field corresponds to an alarm that is prone to high running, but when the total number of high running times is small, the impact on the monitored machine is small, so it needs to be filtered out. A high running threshold can be set in advance. When the total number of high running times in HIGH_CPU_PART exceeds the preset high running threshold, the alarm is treated as an abnormal alarm. If other item tags of SW_PARAM are detected, the corresponding alarm is directly treated as an abnormal alarm.

在另一个例子中,若标签信息指示告警属于进程跑高类别,则获取在当前周期内进程跑高次数超过进程跑高阈值的告警作为异常告警。In another example, if the tag information indicates that the alarm belongs to the process high running category, the alarm that the number of process high running times in the current cycle exceeds the process high running threshold is obtained as an abnormal alarm.

对于标签SW_MORE中的进程容易出现跑高的告警,同理,跑高总次数较少时对被监控的机器的服务的影响就较小,因此需要滤除,可以预先设置进程跑高阈值,当跑高进程中的总跑高次数超过预设的进程跑高阈值时,将对应的告警作为异常告警。另外,本实施例对不同的进程关注度存在差异,对于关注度较高的进程,可以将进程跑高阈值设定较小,反之,可以将进程跑高阈值设置较高。例如,对于组件下相关的进程shark、squid、wsxserver、appa,进程跑高阈值设为30,其他进程跑高阈值设为60。For processes in the label SW_MORE, it is easy to have high alarms. Similarly, when the total number of high times is small, the impact on the service of the monitored machine is small, so it needs to be filtered out. The process high threshold can be set in advance. When the total number of high times in the high process exceeds the preset process high threshold, the corresponding alarm will be used as an abnormal alarm. In addition, this embodiment has different attention levels for different processes. For processes with higher attention levels, the process high threshold can be set to a smaller value, and vice versa, the process high threshold can be set to a higher value. For example, for the related processes shark, squid, wsxserver, and appa under the component, the process high threshold is set to 30, and the other process high thresholds are set to 60.

步骤204:将异常告警存储至分析数据库。Step 204: Store the abnormal alarm in the analysis database.

为了便于对进程做可视化分析与监控,本示例中以进程为粒度,将进程相关数据存储在一张表中,如表1所示:In order to facilitate visual analysis and monitoring of the process, this example uses the process as the granularity and stores the process-related data in a table, as shown in Table 1:

表1Table 1

步骤205:根据异常告警对应的告警检测信息以及异常告警,确定每个异常类别的分析数据。Step 205: Determine analysis data for each abnormality category according to the alarm detection information corresponding to the abnormal alarm and the abnormal alarm.

在一个例子中,根据每个告警检测信息以及异常告警,统计每个异常类别中的异常告警的持续时长、异常告警所在机器的位置信息和异常告警所在机器的类型分布信息。In one example, according to each alarm detection information and abnormal alarm, statistics are collected on the duration of the abnormal alarm in each abnormal category, the location information of the machine where the abnormal alarm is located, and the type distribution information of the machine where the abnormal alarm is located.

获取各个异常告警后,获取预设的各异常类别对应的指标以及跑高的进程所在机器数目等数据,从获取的所有机器的告警数据中统计跑高进程机器数突增的异常告警,以便后续可以可视化或邮件通知相关人员处理,通过对各异常类别中异常告警关联的数据的统计,可以实时跟踪该异常类别的问题对服务器性能影响,进而可以辅助工作人员及时发现问题,例如,尤其在服务器的组件升级后,通过获取该分析数据可以及时发现问题,降低该异常类别的故障对服务的影响。例如,若进程IP数突增监控,如表2所示,可以统计获得异常类别中的异常告警的持续时长、异常告警所在机器的位置信息和异常告警所在机器的类型分布信息。After obtaining each abnormal alarm, obtain the indicators corresponding to each preset abnormal category and the data such as the number of machines where the running process is located. From the alarm data of all machines obtained, count the abnormal alarms of the sudden increase in the number of running process machines, so that the relevant personnel can be visualized or notified by email to handle it later. By counting the data associated with the abnormal alarms in each abnormal category, the impact of the problem of the abnormal category on the server performance can be tracked in real time, and then the staff can be assisted to find the problem in time. For example, especially after the component of the server is upgraded, by obtaining the analysis data, the problem can be found in time, and the impact of the failure of the abnormal category on the service can be reduced. For example, if the number of process IPs increases suddenly, as shown in Table 2, the duration of the abnormal alarm in the abnormal category, the location information of the machine where the abnormal alarm is located, and the type distribution information of the machine where the abnormal alarm is located can be obtained.

表2Table 2

从表2中可以可知,较上一周期,该进程出现异常告警的机器数突增,统计了异常告警所在机器的类型的分布信息,基于表2的信息,有利于后续工作人员定位故障。It can be seen from Table 2 that the number of machines with abnormal alarms in this process has increased sharply compared with the previous cycle. The distribution information of the types of machines where the abnormal alarms are located is statistically analyzed. Based on the information in Table 2, it is helpful for subsequent staff to locate the fault.

或者,在另一个例子中,根据每个告警检测信息以及异常告警,获取满足预设分析条件的进程作为关键进程,统计关键进程中异常告警的持续时长、每个异常告警所在机器的类型和异常告警所在机器的位置信息。Alternatively, in another example, based on each alarm detection information and abnormal alarm, the process that meets the preset analysis conditions is obtained as the key process, and the duration of the abnormal alarm in the key process, the type of machine where each abnormal alarm is located, and the location information of the machine where the abnormal alarm is located are counted.

具体地,可以预先设置预设分析条件,例如,预设分析条件为内核跑高持续时长超过预设的持续阈值,该持续阈值为60S。将满足该预设分析条件的进程作为关键进程,统计关键进程中异常告警的持续时长、每个异常告警所在机器的类型和异常告警所在机器的位置信息。Specifically, a preset analysis condition can be set in advance, for example, the preset analysis condition is that the duration of the kernel running high exceeds a preset duration threshold, and the duration threshold is 60S. The process that meets the preset analysis condition is regarded as a key process, and the duration of the abnormal alarm in the key process, the type of the machine where each abnormal alarm is located, and the location information of the machine where the abnormal alarm is located are counted.

通过对关键进程及相关信息的监控,便于后续主动排查故障根因以及进行优化,从而保证机器的正常运行。例如,本示例中对关键进程的统计,得到如表3所示的信息:By monitoring key processes and related information, it is convenient to proactively troubleshoot the root cause of the fault and perform optimization, thereby ensuring the normal operation of the machine. For example, in this example, statistics on key processes are obtained, as shown in Table 3:

表3Table 3

其中,process列中的各进程的数据格式是经过进一步处理的,例如,将内存跑高总持续时长大于等于60秒的进程作为关键进程,关键进程ksoftirqd_352表示上个周期(2020-09-05 18:31~2020-09-05 18:45)内该进程跑高总持续时长为352s,该ksoftirqd_352后续没有数值,表明上两个周期(2020-09-05 18:16~2020-09-05 18:30)内跑高总持续时长<60s,不存在关键进程;kswapd_588_125中从左至右的第一数值表示上个周期内该进程跑高持续时长为588s,第二个数值125表示该进程相比上个周期增长了125s;统计在该周期内的总跑高持续时长大于预设阈值的进程作为关键进程,便于提醒工作人员重点关注关键进程。Among them, the data format of each process in the process column is further processed. For example, the process with a total duration of high memory usage greater than or equal to 60 seconds is regarded as a key process. The key process ksoftirqd_352 indicates that the total duration of high memory usage of this process in the last cycle (2020-09-05 18:31 to 2020-09-05 18:45) is 352s. There is no subsequent value for ksoftirqd_352, indicating that the total duration of high memory usage of this process in the last two cycles (2020-09-05 18:16 to 2020-09-05 18:30) within the period < 60s, there is no critical process; the first value from left to right in kswapd_588_125 indicates that the process lasted 588s during the previous period, and the second value 125 indicates that the process increased by 125s compared with the previous period; the processes whose total high running duration in the period is greater than the preset threshold are counted as critical processes, so as to remind the staff to focus on the critical processes.

或者,在另一个例子中,根据每个告警检测信息以及异常告警,统计异常告警在周期内每个指定时段内的机器数目、用于指示机器数目占被监控机器的总数的占比信息。Or, in another example, based on each alarm detection information and abnormal alarm, the number of machines with abnormal alarms in each specified time period within the cycle is counted to indicate the proportion of the number of machines to the total number of monitored machines.

具体地,若存在获取详细信息的需求,可以根据异常类别的不同阈值,如,最小/最大/总计/平均持续次数阈值,统计异常告警在周期内每个指定时段内的机器数目、用于指示机器数目占被监控机器的总数的占比信息。Specifically, if there is a need to obtain detailed information, the number of machines that have abnormal alarms in each specified time period within the cycle can be counted based on different thresholds of the abnormality category, such as the minimum/maximum/total/average duration thresholds, to indicate the proportion of the number of machines to the total number of monitored machines.

或者,在另一个例子中,根据每个告警检测信息以及异常告警,统计每个异常类别分布的机器类型、异常类别分布的机器的内核版本信息、异常类别所在产品线信息、或者异常进程。Or, in another example, based on each alarm detection information and abnormal alarm, statistics are collected on the machine type of each abnormal category distribution, the kernel version information of the machine of the abnormal category distribution, the product line information of the abnormal category, or the abnormal process.

通过对异常告警的统计,可以获取异常告警的机器的类型、产品线、内核版本,或各机器类型、产品线、内核版本对应的异常告警以及异常告警所在进程,便于分析是版本问题还是机器本身问题导致的出现异常告警。By counting abnormal alarms, you can obtain the type, product line, and kernel version of the machine with the abnormal alarm, or the abnormal alarms corresponding to each machine type, product line, and kernel version, as well as the process where the abnormal alarm is located, so as to analyze whether the abnormal alarm is caused by a version problem or a problem with the machine itself.

或者,在另一个例子中,根据每个告警检测信息以及异常告警,根据统计的异常告警,获取运行最差的K个机器的信息,K为大于1的整数。Or, in another example, according to each alarm detection information and abnormal alarm, according to the statistical abnormal alarm, the information of the worst performing K machines is obtained, where K is an integer greater than 1.

获取单个机器的各性能数据变化情况,便于捕捉机器性能异常时的故障现场,不需要登陆到本机去查看具体情况。Obtain the performance data changes of a single machine to capture the fault site when the machine performance is abnormal, without having to log in to the machine to view the specific situation.

步骤206:根据接收的显示指令,对获取的异常告警和/或分析数据进行可视化。Step 206: Visualize the acquired abnormal alarm and/or analysis data according to the received display instruction.

对获取的异常告警以及分析数据进行可视化。本实施例中选择grafana平台作为可视化平台,该平台支持多种数据源及各类图表,操作方便灵活。Visualize the acquired abnormal alarms and analysis data. In this embodiment, the grafana platform is selected as the visualization platform, which supports multiple data sources and various charts, and is convenient and flexible to operate.

根据输入的可视化指令,确定可视化的异常告警以及分析数据。例如,可以可视化如表2的数据,还可以根据表2的数据绘制相关的曲线图,如图3所示。According to the input visualization instruction, the visualized abnormal alarm and the analysis data are determined. For example, the data in Table 2 can be visualized, and a related curve chart can be drawn according to the data in Table 2, as shown in FIG3 .

上面各种方法的步骤划分,只是为了描述清楚,实现时可以合并为一个步骤或者对某些步骤进行拆分,分解为多个步骤,只要包括相同的逻辑关系,都在本专利的保护范围内;对算法中或者流程中添加无关紧要的修改或者引入无关紧要的设计,但不改变其算法和流程的核心设计都在本发明的保护范围内。The step division of the above methods is only for clear description. When implemented, they can be combined into one step or some steps can be split and decomposed into multiple steps. As long as they include the same logical relationship, they are all within the protection scope of this patent; adding insignificant modifications to the algorithm or process or introducing insignificant designs without changing the core design of the algorithm and process are all within the protection scope of the present invention.

本发明第三实施方式涉及一种监控机器性能的方法,第三实施方式是对上述实施方式中步骤:获取预设周期内至少两个被监控的机器各自的资源运行数据的另一种实现方式,其流程如图4所示:The third embodiment of the present invention relates to a method for monitoring machine performance. The third embodiment is another implementation of the step in the above embodiment: obtaining resource operation data of at least two monitored machines within a preset period. The process is shown in FIG4 :

步骤301:获取预设周期内至少两个被监控的机器各自的资源运行数据,汇总数据库用于接收各个被监控的机器上报的资源运行数据。Step 301: Obtain resource operation data of at least two monitored machines within a preset period, and a summary database is used to receive resource operation data reported by each monitored machine.

具体地,监控平台与汇总数据库通信连接,各被监控的机器与该汇总数据库连接,该汇总数据库汇总来自各个机器的资源运行数据。若被监控的机器的数量大,通过设置汇总数据库,从汇总数据库中获取资源运行数据,可以进一步减少待处理的数据量。Specifically, the monitoring platform is connected to the summary database, and each monitored machine is connected to the summary database, which summarizes the resource operation data from each machine. If the number of monitored machines is large, the amount of data to be processed can be further reduced by setting up a summary database and obtaining resource operation data from the summary database.

本示例中将详细介绍一个机器获取资源运行数据的过程,每个被监控的机器进行如下处理:获取该机器中每个资源块运行时产生的资源数据,资源数据包括告警;根据每个资源块对应的分析策略,获取每个资源块中告警的告警检测信息以及为每个资源数据中所述告警添加标签信息,标签信息用于指示所述告警的告警所属类别;根据告警检测信息以及资源数据,确定机器的资源运行数据。This example describes in detail the process of a machine obtaining resource operation data. Each monitored machine performs the following processing: obtaining the resource data generated during the operation of each resource block in the machine, the resource data including alarms; obtaining the alarm detection information of the alarm in each resource block and adding label information to the alarm in each resource data according to the analysis strategy corresponding to each resource block, the label information is used to indicate the alarm category to which the alarm belongs; determining the resource operation data of the machine according to the alarm detection information and the resource data.

在一个例子中,若资源块为CPU,根据每个资源块对应的分析策略,为每个资源数据中告警添加标签信息,包括:若检测到CPU对应的告警达到预设的跑高阈值后,则为告警的添加指示告警属于CPU跑高类型的标签信息,跑高类型包括:全核跑高类别或部分跑高类别;和/或,根据告警获取CPU的跑高线程,判断跑高线程是否为内核线程,若跑高线程为内核线程,则判断跑高线程是否满足性能分析的阈值,若是,则进行内核分析,获取用于表征分析结果的字段信息,将字段信息作为告警的标签信息,若不满足阈值,则丢弃所述告警;若跑高线程为服务线程,则进行内核分析,获取用于表征分析结果的字段信息,将字段信息作为告警的标签信息。In one example, if the resource block is a CPU, according to the analysis strategy corresponding to each resource block, label information is added to the alarm in each resource data, including: if it is detected that the alarm corresponding to the CPU reaches a preset high running threshold, label information is added to the alarm to indicate that the alarm belongs to the CPU high running type, and the high running type includes: full-core high running category or partial high running category; and/or, according to the alarm, the high running thread of the CPU is obtained, and it is determined whether the high running thread is a kernel thread. If the high running thread is a kernel thread, it is determined whether the high running thread meets the performance analysis threshold. If so, kernel analysis is performed to obtain field information used to characterize the analysis results, and the field information is used as the label information of the alarm. If the threshold is not met, the alarm is discarded; if the high running thread is a service thread, kernel analysis is performed to obtain field information used to characterize the analysis results, and the field information is used as the label information of the alarm.

可以通过独立进程进行CPU循环采样,以1s的采样间隔检测CPU持续跑高的进程:根据CPU快照和TOP快照检测各进程跑高情况,其中CPU快照和TOP快照采集频率同步,确保跑高期间采集到对应的线程运行数据。根据跑高的CPU核数来判断告警类型为全核跑高(HIGH_CPU_ALL)或是部分核跑高(HIGH_CPU_PART)。根据线程类型可以判断跑高线程是否为内核线程跑高(HIGH_KERNEL_THREAD)。对于服务进程和CPU使用率触发perf阈值的内核线程做性能分析,并将分析结果上报到汇总数据库,服务进程,例如web服务进程shark、cache服务进程squid等;内核线程如kswapd、ksoftirqd等。每15分钟汇总当前周期内CPU和线程检测结果,如检测到部分核跑高HIGH_CPU_PART_1_69_831_17,进一步地检测到跑高的进程kswapd_1_69_831_17,则表示这个周期内存在部分核跑高,具体的为kswapd线程跑高。其中kswapd_1_69_831_17表示15分钟即900s内有831s秒跑高,跑高持续时间较长,若下一个周期内依旧跑高,说明异常可能性更大,上报分析结果。You can perform CPU cyclic sampling through an independent process, and detect processes with continuously high CPU usage at a sampling interval of 1s: Detect the high CPU usage of each process based on the CPU snapshot and TOP snapshot. The CPU snapshot and TOP snapshot collection frequencies are synchronized to ensure that the corresponding thread running data is collected during the high CPU usage. The alarm type is determined based on the number of CPU cores that are running high, whether it is all cores running high (HIGH_CPU_ALL) or some cores running high (HIGH_CPU_PART). Based on the thread type, you can determine whether the high CPU usage thread is a kernel thread running high (HIGH_KERNEL_THREAD). Perform performance analysis on service processes and kernel threads whose CPU usage triggers the perf threshold, and report the analysis results to the summary database, service processes, such as web service process shark, cache service process squid, etc.; kernel threads such as kswapd, ksoftirqd, etc. The CPU and thread detection results in the current cycle are summarized every 15 minutes. For example, if HIGH_CPU_PART_1_69_831_17 is detected for some cores, and the process kswapd_1_69_831_17 is further detected, it means that some cores are running high in this cycle, specifically the kswapd thread is running high. Among them, kswapd_1_69_831_17 means that the high running time lasted for a long time in 15 minutes, that is, 831 seconds in 900 seconds. If it is still running high in the next cycle, it means that the abnormality is more likely, and the analysis results are reported.

在另一个例子中,若资源块为输入/输出I/O,根据每个资源块对应的分析策略,为每个资源数据中告警添加标签信息,包括:根据告警获取CPU中的用于指示等待I/O完成时间的第一数据;若第一数据大于I/O阈值,则触发对I/O进行检测,查找第二数据超过检测阈值的盘作为高盘,第二数据用于指示盘在每秒内资源的运行占比;针对每个高盘进行如下处理:若检测到请求服务时间大于时间阈值,则为告警添加指示服务时间异常的标签信息;若队列等待时间超过等待阈值,则为告警添加指示等待时间异常的标签信息;若检测到单位时间内读写量超过预设的读写阈值,则为告警添加指示吞吐量异常的标签信息。In another example, if the resource block is an input/output I/O, according to the analysis strategy corresponding to each resource block, label information is added to the alarm in each resource data, including: obtaining the first data in the CPU for indicating the waiting time for I/O completion according to the alarm; if the first data is greater than the I/O threshold, triggering the detection of the I/O, and finding the disk whose second data exceeds the detection threshold as the high disk, and the second data is used to indicate the operating ratio of the disk resources per second; performing the following processing for each high disk: if it is detected that the request service time is greater than the time threshold, adding label information indicating that the service time is abnormal to the alarm; if the queue waiting time exceeds the waiting threshold, adding label information indicating that the waiting time is abnormal to the alarm; if it is detected that the read and write volume per unit time exceeds the preset read and write threshold, adding label information indicating that the throughput is abnormal to the alarm.

具体地,CPU采样中,获取CPU的iowait值,iowait用于指示等待I/O完成时间;任意一个CPU的iowait值大于预设定阈值即触发IO检测,查找IOUTIL超过阈值的盘,其中,IOUTIL指示盘的在每秒内资源运行占比。针对查找到的盘进行如下处理:当检测到请求服务时间大于阈值,则判定告警因素为LONG_SERVICE_TIME;检测到队列等待时间过长LONG_WAIT_TIME;检测到单位时间读写量太大,则判定告警因素为LARGE_THROUGHPUT。在汇总数据中,记录周期内IOWAIT在的最小/最大跑高持续次数、总计跑高次数以及平均跑高持续次数。Specifically, during CPU sampling, the iowait value of the CPU is obtained. iowait is used to indicate the waiting time for I/O completion. If the iowait value of any CPU is greater than the preset threshold, IO detection is triggered, and the disk whose IOUTIL exceeds the threshold is searched, where IOUTIL indicates the resource operation ratio of the disk per second. The following processing is performed on the found disk: when it is detected that the request service time is greater than the threshold, the alarm factor is determined to be LONG_SERVICE_TIME; when it is detected that the queue waiting time is too long, LONG_WAIT_TIME; when it is detected that the read and write volume per unit time is too large, the alarm factor is determined to be LARGE_THROUGHPUT. In the summary data, the minimum/maximum high running duration, total high running times, and average high running duration of IOWAIT during the period are recorded.

在另一个例子中,若资源块为内存,根据每个资源块对应的分析策略,为每个资源数据中告警添加标签信息,包括:根据获取的内存信息,确定机器中的内存异常类别;判断内存异常类别是否属于指定类别,若是,则为告警添加指示指定类别的标签信息,并获取在预设时段内内存异常类别告警的持续次数、内存的跑高的总数值以及内存的平均跑高持续次数,指定类型包括以下任意组合:系统不可释放内存突变、系统中SLAB分配器不可释放内存突变、内存突变、内存变化率超过预设的变化率阈值以及所述SLAB分配器的使用率超过预设的使用率阈值。In another example, if the resource block is memory, according to the analysis strategy corresponding to each resource block, label information is added to the alarm in each resource data, including: determining the memory exception category in the machine according to the acquired memory information; judging whether the memory exception category belongs to a specified category, and if so, adding label information indicating the specified category to the alarm, and obtaining the number of durations of memory exception category alarms within a preset time period, the total value of memory highs, and the average number of durations of memory highs, wherein the specified types include any combination of the following: system unreleasable memory mutations, system SLAB allocator unreleasable memory mutations, memory mutations, memory change rate exceeding a preset change rate threshold, and the usage rate of the SLAB allocator exceeding a preset usage rate threshold.

获取内存信息,根据内存信息判断机器中告警的类别,若为系统总体不可释放内存突变的类别,则添加标签信息“UNRECLAIM_SURGE”,若为系统SLAB不可释放内存突变,则添加标签信息“SLAB_UNRECLAIM_SURGE”;若为内存突变,则添加标签信息“MEMORY_SUDDEN_CHANGE”;若为内存变化率超过预设的变化率阈值的类别,则添加标签信息MEMORY_USAGE_HIGH;若为SLAB分配器的使用率超过预设的使用率阈值,则添加标签信息SLAB_HIGH。其中,变化率阈值和使用率阈值可以预先根据实际需要进行设置。若存在至少其中一个,则进一步地检测在该15分钟周期内所存在状态的最小/最大跑高持续次数、总计跑高次数以及平均跑高持续次数。Get the memory information, and determine the type of alarm in the machine based on the memory information. If it is the type of system-wide unreleasable memory mutation, add the label information "UNRECLAIM_SURGE"; if it is the system SLAB unreleasable memory mutation, add the label information "SLAB_UNRECLAIM_SURGE"; if it is a memory mutation, add the label information "MEMORY_SUDDEN_CHANGE"; if it is the type of memory change rate exceeding the preset change rate threshold, add the label information MEMORY_USAGE_HIGH; if it is the type of SLAB allocator usage rate exceeding the preset usage rate threshold, add the label information SLAB_HIGH. Among them, the change rate threshold and the usage rate threshold can be set in advance according to actual needs. If at least one of them exists, further detect the minimum/maximum high-running duration times, total high-running times, and average high-running duration times of the state existing in the 15-minute period.

在另一个例子中,若资源块为负载LOAD,为该告警添加标签信息如下:当系统上一分钟内的负载达到了预设阈值或CPU跑高但没有高使用率线程时,判断该告警的类别是否为负载指定类别,若为负载指定类别,则为该告警添加指示该负载指定类别的标签信息,负载异常类别包括以下任意一种的组合:异常线程类别,平均等待队列时长超过预设的等待阈值,running态线程超出第一线程阈值,D状态线程超出第二线程阈值。若存在异常线程,则为该告警添加标签信息“ABNORMAL_THREAD”、若存在平均等待队列超过预设的等待阈值,可以为该告警添加标签“LONG_WAIT_QUEUE”、若running态线程超过对应的第一线程阈值,则为该告警添加标签信息“MULTI_R_THREAD”、若D状态线程超过第二线程阈值,则为该告警添加标签信息“MULTI_D_THREAD”。In another example, if the resource block is a load LOAD, the label information added to the alarm is as follows: when the load of the system within the last minute reaches the preset threshold or the CPU runs high but there is no high-usage thread, determine whether the category of the alarm is a load-specified category. If it is a load-specified category, add label information indicating the load-specified category to the alarm. The load abnormality category includes any combination of the following: abnormal thread category, average waiting queue time exceeds the preset waiting threshold, running state thread exceeds the first thread threshold, D state thread exceeds the second thread threshold. If there is an abnormal thread, add the label information "ABNORMAL_THREAD" to the alarm, if there is an average waiting queue that exceeds the preset waiting threshold, you can add the label "LONG_WAIT_QUEUE" to the alarm, if the running state thread exceeds the corresponding first thread threshold, add the label information "MULTI_R_THREAD" to the alarm, if the D state thread exceeds the second thread threshold, add the label information "MULTI_D_THREAD" to the alarm.

若存在至少其中一个类别,则进一步地检测在该15分钟周期内所存在状态的最小/最大跑高持续次数、总计跑高次数以及平均跑高持续次数。If at least one of the categories exists, the minimum/maximum running height duration times, the total running height times and the average running height duration times of the state existing in the 15-minute period are further detected.

若资源块为流量资源,则获取上个周期内平均每分钟连接数以及进出口流量,如TRAFFIC_PER_MIN_20.44K|8682.6MB|13426.1MB。If the resource block is a traffic resource, obtain the average number of connections per minute and import and export traffic in the previous period, such as TRAFFIC_PER_MIN_20.44K|8682.6MB|13426.1MB.

值得一提的是,通过记录告警发生的持续时长、持续次数等信息,可以判断该告警对整个系统的影响时长;同时也可以通过记录的持续时长、持续次数等信息感知机器出现异常时各个告警的持续情况,以分析机器故障的原因。若为消除故障进行了相应的处理操作后,还可以通过检测的告警持续时长、持续次数信息等,判断故障是否得到解决或改善。通过记录的告警发生的持续时长、持续次数等信息还可以筛选出偶发性的问题,减少故障分析的数据量。It is worth mentioning that by recording the duration and number of times the alarm occurs, it is possible to determine how long the alarm affects the entire system; at the same time, the duration and number of times the alarm occurs can be used to detect the duration of each alarm when the machine is abnormal, so as to analyze the cause of the machine failure. If the corresponding processing operations are performed to eliminate the fault, it is also possible to determine whether the fault has been resolved or improved by detecting the duration and number of times the alarm occurs. By recording the duration and number of times the alarm occurs, it is also possible to screen out occasional problems and reduce the amount of data for fault analysis.

若资源块为硬件,分别检查机器是否发生了NUMA节点内存不均衡(NUMA_TOTAL_UNBALANCE)、内存坏页(MEM_HARDWARE_ERROR)、磁盘故障(DISK_ERROR)、CPU过热(HIGH_CPU_TMP)、CPU掉线(CPU_OFFLINE)、CPU运行模式错误(LOW_CPU_PERFORMANCE)、CPU降频(LOW_CPU_FREQ)等硬件问题,并为该告警添加对应类别的标签信息。由于硬件问题一旦发生一般会维持较长时间,所以不需要记录持续次数,仅记录是否发生。If the resource block is hardware, check whether the machine has hardware problems such as NUMA node memory imbalance (NUMA_TOTAL_UNBALANCE), memory bad page (MEM_HARDWARE_ERROR), disk failure (DISK_ERROR), CPU overheating (HIGH_CPU_TMP), CPU offline (CPU_OFFLINE), CPU operation mode error (LOW_CPU_PERFORMANCE), CPU frequency reduction (LOW_CPU_FREQ), etc., and add label information of the corresponding category to the alarm. Since hardware problems usually last for a long time once they occur, there is no need to record the duration, only whether they occur.

将每台机器的告警因素以及资源数据作为资源运行数据上报至汇总数据库。The alarm factors and resource data of each machine are reported to the summary database as resource operation data.

步骤302:获取资源运行数据中每个告警的标签信息,标签信息用于指示告警的告警所属类别,由各机器根据告警的类别标注获得。Step 302: Obtain label information of each alarm in the resource operation data. The label information is used to indicate the category to which the alarm belongs, and is obtained by each machine according to the category label of the alarm.

步骤303:根据每个告警的标签信息,从资源运行数据中获取属于异常类别的告警作为异常告警。Step 303: According to the label information of each alarm, alarms belonging to the abnormal category are obtained from the resource operation data as abnormal alarms.

步骤304:根据异常告警对应的告警检测信息以及异常告警,确定每个异常类别的分析数据,分析数据包括:异常类别的异常告警所在机器的信息。Step 304: Determine analysis data for each abnormal category based on the alarm detection information corresponding to the abnormal alarm and the abnormal alarm, where the analysis data includes: information on the machine where the abnormal alarm of the abnormal category is located.

步骤305:根据接收的显示指令,对获取的异常告警和/或分析数据进行可视化。Step 305: Visualize the acquired abnormal alarm and/or analysis data according to the received display instruction.

步骤306:获取机器在产生异常告警时的质量变化数据,质量变化数据用于指示机器运行过程中质量指标变化数据的数据。Step 306: Acquire quality change data of the machine when an abnormal alarm is generated. The quality change data is used to indicate the quality indicator change data during the operation of the machine.

具体地,在发生异常告警时,机器的质量指标将发生改变,可以通过关联该质量变化数据和异常告警,以观察异常告警对机器运行的影响。例如,若异常告警的类别为进程跑高,可以将该异常告警与质量变化数据进行关联。或者,也可以在检测到质量变化数据时,获取此时产生的异常告警,关联该异常告警和该质量变化数据。Specifically, when an abnormal alarm occurs, the quality index of the machine will change. By associating the quality change data with the abnormal alarm, the impact of the abnormal alarm on the operation of the machine can be observed. For example, if the category of the abnormal alarm is a high process run, the abnormal alarm can be associated with the quality change data. Alternatively, when the quality change data is detected, the abnormal alarm generated at this time can be obtained, and the abnormal alarm and the quality change data can be associated.

质量指标数据可以来源于服务端日志或客户端日志。本实施例中取实时采集的服务端TCP访问日志,以同样的时间(周期时长)粒度计算各机器上不同域名的质量指标数据,由于不同域名关注的质量指标可能不同,因此可同时计算不同质量指标的数值或根据需要进行选择计算。质量指标数据包括但不局限于:重传比、传输速度、>1M传输速度(对于小文件的传输,传输时间可能很短,导致速度波动大,因此添加针对大文件的传输情况的评估指标),卡顿率、首包时长、首屏时长、质量因子等。The quality indicator data can come from the server log or the client log. In this embodiment, the server TCP access log collected in real time is taken to calculate the quality indicator data of different domain names on each machine with the same time (cycle duration) granularity. Since the quality indicators concerned by different domain names may be different, the values of different quality indicators can be calculated at the same time or selected and calculated as needed. The quality indicator data includes but is not limited to: retransmission ratio, transmission speed, >1M transmission speed (for the transmission of small files, the transmission time may be very short, resulting in large speed fluctuations, so an evaluation indicator for the transmission of large files is added), jam rate, first packet duration, first screen duration, quality factor, etc.

步骤307:关联质量变化数据和异常告警。Step 307: Associating the quality change data with the abnormality alarm.

对于部分核跑高的机器性能数据,不一定会影响到质量,因此可根据结果来过滤出影响较大的条目进行优化。For machine performance data with some cores running high, it may not necessarily affect the quality, so you can filter out the items with greater impact based on the results and optimize them.

步骤308:可视化关联的质量变化数据和异常告警。Step 308: Visualize the associated quality change data and abnormal alarms.

另外,可基于关联结果,汇总对质量影响较大的性能指标及进程信息,可提醒工作人员关注,并加入到监控中,以及时优化问题,降低影响。In addition, based on the correlation results, the performance indicators and process information that have a greater impact on quality can be summarized, and the staff can be reminded to pay attention and add them to the monitoring to optimize the problems in time and reduce the impact.

本发明第四实施方式涉及一种监控机器性能的方法。该方法应用于被监控的机器上,其流程如图5所示:The fourth embodiment of the present invention relates to a method for monitoring machine performance. The method is applied to the monitored machine, and its process is shown in FIG5 :

步骤401:获取机器中每个资源块运行时产生的资源数据,资源数据包括告警。Step 401: Obtain resource data generated when each resource block in the machine is running, and the resource data includes alarms.

步骤402:根据每个资源块对应的分析策略,获取每个资源块中告警的告警检测信息以及为每个资源数据中告警添加标签信息,标签信息用于指示告警的告警所属类别。Step 402: According to the analysis strategy corresponding to each resource block, the alarm detection information of the alarm in each resource block is obtained and label information is added to the alarm in each resource data, where the label information is used to indicate the category to which the alarm belongs.

步骤403:根据告警检测信息以及资源数据,确定机器的资源运行数据,以供监控平台获取预设周期内资源运行数据,获取所述资源运行数据中每个告警的标签信息;根据每个所述告警的标签信息,从所述资源运行数据中获取属于异常类别的所述告警作为异常告警,根据所述异常告警对应的告警检测信息以及所述异常告警,确定每个异常类别的分析数据;根据接收的显示指令,对获取的所述异常告警和/或所述分析数据进行可视化;分析数据包括:异常类别的异常告警所在机器的信息。Step 403: Determine the resource operation data of the machine based on the alarm detection information and the resource data, so that the monitoring platform can obtain the resource operation data within a preset period, and obtain the label information of each alarm in the resource operation data; according to the label information of each alarm, obtain the alarm belonging to the abnormal category from the resource operation data as an abnormal alarm, and determine the analysis data of each abnormal category based on the alarm detection information corresponding to the abnormal alarm and the abnormal alarm; visualize the acquired abnormal alarm and/or the analysis data according to the received display instruction; the analysis data includes: information on the machine where the abnormal alarm of the abnormal category is located.

本实施例中,实时采集被监控的机器的资源数据,资源数据中包含告警,根据资源块对应的分析策略,对每个资源数据中的告警进行分析,为告警添加标签信息并获取该告警的告警检测信息,由于不是仅将获取的告警直接上报,而是分析获得每个告警对应的告警检测信息,以及为告警添加标签信息,有利于后续监控平台根据该告警定位故障,减少故障持续的时间。In this embodiment, resource data of the monitored machine is collected in real time. The resource data contains alarms. According to the analysis strategy corresponding to the resource block, the alarm in each resource data is analyzed, label information is added to the alarm, and the alarm detection information of the alarm is obtained. Since the obtained alarm is not simply reported directly, but the alarm detection information corresponding to each alarm is analyzed and label information is added to the alarm, it is beneficial for the subsequent monitoring platform to locate the fault according to the alarm and reduce the duration of the fault.

本发明第五实施方式涉及一种监控机器性能的方法。本实施方式是对第四实施方式中步骤402的详细介绍,由于每个资源块对应的分析策略不同,导致步骤402有不同的实施过程,下面将按照每个资源块的类别介绍步骤402的实施过程:The fifth embodiment of the present invention relates to a method for monitoring machine performance. This embodiment is a detailed introduction to step 402 in the fourth embodiment. Since each resource block corresponds to a different analysis strategy, step 402 has a different implementation process. The implementation process of step 402 will be introduced below according to the category of each resource block:

一、若资源块为CPU,根据每个资源块对应的分析策略,为每个资源数据中告警添加标签信息的流程如图6所示:1. If the resource block is a CPU, according to the analysis strategy corresponding to each resource block, the process of adding label information to the alarm in each resource data is shown in Figure 6:

步骤S51:若检测到CPU对应的告警达到预设的跑高阈值后,则执行步骤S51或执行步骤S52,或者同时执行步骤S51和步骤S52。Step S51: If it is detected that the alarm corresponding to the CPU reaches the preset high running threshold, then step S51 or step S52 is executed, or step S51 and step S52 are executed simultaneously.

具体地,可以预先设置CPU对应的跑高阈值,通过独立进程进行CPU循环采样,例如,以1s的采样间隔检测CPU持续跑高的进程:根据CPU快照和TOP快照检测各进程跑高情况,其中,CPU快照和TOP快照采集频率同步,确保跑高期间采集到对应的线程运行数据。检测该CPU对应的告警中CPU跑高的值是否超过跑过阈值,若是,则可以进一步进行分析,若未超过CPU的跑高阈值,则可以丢弃该告警,对其他告警进行分析。Specifically, the high running threshold corresponding to the CPU can be pre-set, and the CPU can be sampled cyclically through an independent process. For example, the process with a continuously high CPU running can be detected at a sampling interval of 1 second: the high running situation of each process can be detected based on the CPU snapshot and the TOP snapshot, wherein the acquisition frequency of the CPU snapshot and the TOP snapshot is synchronized to ensure that the corresponding thread running data is collected during the high running period. Check whether the CPU high running value in the alarm corresponding to the CPU exceeds the running threshold. If so, further analysis can be performed. If it does not exceed the CPU high running threshold, the alarm can be discarded and other alarms can be analyzed.

步骤S52:为告警的添加指示告警属于CPU跑高类型的标签信息,跑高类型包括:全核跑高类别或部分跑高类别。Step S52: adding label information indicating that the alarm belongs to the CPU high running type, the high running type includes: all cores high running type or partial cores high running type.

具体地,可以进一步分析当前跑高类型,例如,可以根据跑高的CPU核心数来判断告警类型是否属于全核跑高(HIGH_CPU_ALL)或者,判断该告警是否属于部分核跑高(HIGH_CPU_PART),其中,HIGH_CPU_PART是预先设置的类型,表示采样到的CPU快照中,使用率大于阈值的CPU数大于0且小于CPU总数。HIGH_CPU_PART后面的四个数字分别为,这个周期内触发HIGH_CPU_PART这类问题的最短持续跑高时间、最多持续跑高时间、总计持续跑高时间以及平均每次检测到跑高的持续时间,单位是秒。Specifically, the current high running type can be further analyzed. For example, the alarm type can be judged according to the number of CPU cores running high to determine whether it belongs to all cores running high (HIGH_CPU_ALL) or whether the alarm belongs to some cores running high (HIGH_CPU_PART), where HIGH_CPU_PART is a pre-set type, indicating that in the sampled CPU snapshot, the number of CPUs with a usage rate greater than the threshold is greater than 0 and less than the total number of CPUs. The four numbers after HIGH_CPU_PART are the shortest continuous high running time, the maximum continuous high running time, the total continuous high running time, and the average duration of each high running detection that triggers such problems as HIGH_CPU_PART in this cycle, in seconds.

步骤S53:根据告警获取CPU的跑高线程,判断跑高线程是否为内核线程,若跑高线程为内核线程,则执行步骤S54,若跑高线程为服务线程,则执行步骤S55。Step S53: Obtain the CPU's high-running thread according to the alarm, and determine whether the high-running thread is a kernel thread. If the high-running thread is a kernel thread, execute step S54; if the high-running thread is a service thread, execute step S55.

具体地,可以根据线程类型判断跑高线程是否为内核线程跑高(HIGH_KERNEL_THREAD)。Specifically, it can be determined according to the thread type whether the high-running thread is a kernel thread running high (HIGH_KERNEL_THREAD).

本示例中仅对于服务进程和CPU使用率触发perf阈值的内核线程做性能分析,并将分析结果上报到汇总数据库,服务进程可以是web服务进程shark、cache服务进程squid等;内核线程如kswapd、ksoftirqd等。In this example, performance analysis is performed only for service processes and kernel threads whose CPU usage triggers the perf threshold, and the analysis results are reported to the summary database. Service processes can be web service processes such as shark and cache service processes such as squid; kernel threads can be such as kswapd and ksoftirqd.

步骤S54:判断跑高线程是否满足性能分析的阈值,若是,则执行步骤S55,进行内核分析,并将分析结果作为告警因素,若不满足阈值,则执行步骤S56丢弃告警。Step S54: Determine whether the high-running thread meets the threshold of performance analysis. If so, execute step S55 to perform kernel analysis and use the analysis result as an alarm factor. If it does not meet the threshold, execute step S56 to discard the alarm.

步骤S55:进行内核分析,获取用于表征分析结果的字段信息,将字段信息作为告警的标签信息。Step S55: perform kernel analysis, obtain field information used to characterize the analysis result, and use the field information as label information of the alarm.

步骤S56:丢弃告警。Step S56: discard the alarm.

二、若资源块为输入/输出I/O,根据每个资源块对应的分析策略,为每个资源数据中告警添加标签信息的流程如图7所示:2. If the resource block is an input/output I/O, according to the analysis strategy corresponding to each resource block, the process of adding label information to the alarm in each resource data is shown in Figure 7:

步骤S61:根据告警获取CPU中的用于指示等待I/O完成时间的第一数据。Step S61: acquiring first data indicating the waiting time for I/O completion in the CPU according to the alarm.

具体地,CPU采样中,获取CPU的iowait值,iowait用于指示等待I/O完成时间。Specifically, during CPU sampling, the iowait value of the CPU is obtained, and iowait is used to indicate the waiting time for I/O completion.

步骤S62:若第一数据大于I/O阈值,则触发对I/O进行检测,查找第二数据超过检测阈值的盘作为高盘,第二数据用于指示盘在每秒内资源的运行占比。Step S62: If the first data is greater than the I/O threshold, I/O detection is triggered, and the disk whose second data exceeds the detection threshold is found as a high disk. The second data is used to indicate the operating ratio of the disk resources per second.

具体地,任意一个CPU的iowait值大于预设定阈值即触发IO检测,查找IOUTIL超过阈值的盘,其中,IOUTIL指示该盘在每秒内资源的运行占比。Specifically, if the iowait value of any CPU is greater than a preset threshold, IO detection is triggered to find the disk whose IOUTIL exceeds the threshold, where IOUTIL indicates the operation ratio of the disk's resources per second.

步骤S63:针对每个高盘进行如下处理:若检测到请求服务时间大于时间阈值,则为告警添加指示服务时间异常的标签信息;若队列等待时间超过等待阈值,则为告警添加指示等待时间异常的标签信息;若检测到单位时间内读写量超过预设的读写阈值,则为告警添加指示吞吐量异常的标签信息。Step S63: Perform the following processing for each high disk: if it is detected that the request service time is greater than the time threshold, add label information indicating that the service time is abnormal to the alarm; if the queue waiting time exceeds the waiting threshold, add label information indicating that the waiting time is abnormal to the alarm; if it is detected that the read and write volume per unit time exceeds the preset read and write threshold, add label information indicating that the throughput is abnormal to the alarm.

针对查找到的盘进行如下处理:当检测到请求服务时间大于阈值,则判定该告警的类别为服务时间异常的类别,为该告警添加的标签信息为“LONG_SERVICE_TIME”;检测到队列等待时间过长LONG_WAIT_TIME;检测到单位时间读写量太大,则判定告警的类别为等待时间异常的类别,可以为该告警添加“LARGE_THROUGHPUT”的标签信息。此外,若触发I/O检测,则可以记录该告警产生时,在预设周期内的IOWAIT在的最小/最大跑高持续次数、总计跑高次数以及平均跑高持续次数,即在汇总数据中,记录周期内IOWAIT在的最小/最大跑高持续次数、总计跑高次数以及平均跑高持续次数。The following processing is performed on the found disks: when it is detected that the request service time is greater than the threshold, the alarm category is determined to be the service time abnormal category, and the label information added to the alarm is "LONG_SERVICE_TIME"; when it is detected that the queue waiting time is too long, LONG_WAIT_TIME; when it is detected that the read and write volume per unit time is too large, the alarm category is determined to be the waiting time abnormal category, and the label information "LARGE_THROUGHPUT" can be added to the alarm. In addition, if the I/O detection is triggered, the minimum/maximum high-running duration times, total high-running times, and average high-running duration times of IOWAIT in the preset period when the alarm is generated can be recorded, that is, in the summary data, the minimum/maximum high-running duration times, total high-running times, and average high-running duration times of IOWAIT in the period are recorded.

三、若资源块为内存,根据每个资源块对应的分析策略,为每个资源数据中告警添加标签信息的流程如图8所示:3. If the resource block is memory, according to the analysis strategy corresponding to each resource block, the process of adding label information to the alarm in each resource data is shown in Figure 8:

步骤S71:根据获取的内存信息,确定机器中内存的内存异常类别。Step S71: Determine the memory abnormality category of the memory in the machine according to the acquired memory information.

获取内存信息,根据内存信息判断机器中内存的内存异常类别。Obtain memory information, and determine the memory exception category of the memory in the machine according to the memory information.

步骤S72:判断内存异常类别是否属于指定类别,若是指定类别,则执行步骤S73,否则,执行步骤S75,丢弃该告警。Step S72: Determine whether the memory exception category belongs to the specified category. If it is the specified category, execute step S73; otherwise, execute step S75 and discard the alarm.

具体地,指定类别包括以下一种或多种的组合:系统总体不可释放内存突变(UNRECLAIM_SURGE)、系统SLAB不可释放内存突变(SLAB_UNRECLAIM_SURGE)、内存突变(MEMORY_SUDDEN_CHANGE)、内存使用率过高(MEMORY_USAGE_HIGH)、SLAB占用过高(SLAB_HIGH)。Specifically, the designated categories include one or more combinations of the following: system overall unreleasable memory surge (UNRECLAIM_SURGE), system SLAB unreleasable memory surge (SLAB_UNRECLAIM_SURGE), memory surge (MEMORY_SUDDEN_CHANGE), high memory usage (MEMORY_USAGE_HIGH), and high SLAB occupancy (SLAB_HIGH).

若确定该内存异常类别为指定类别,则执行步骤S72,若不为指定类别,则丢弃该告警。If it is determined that the memory exception category is a specified category, step S72 is executed; if it is not a specified category, the alarm is discarded.

步骤S73:为告警添加指示指定类型的标签信息。Step S73: Add label information indicating a specified type to the alarm.

标签信息用于指示告警的类别。例如,本示例中将HW字段作为硬件告警的大类标签,再如标签信息还可以是MEMORY、CPU、IO等;标签信息为HW_PARAM表示对应检测到的具体异常的信息,如LOW_CPU_FREQ、DISK_ERROR、NUMA_TOTAL_UNBALANCE等;SW标签表示发生过的与软件相关的问题标签,如HIGH_CPU_PART、MEMORY_SUDDEN_CHANGE等,SW_PARAM是对应检测到的SW问题的具体值;SW_MORE包含两种含义,分别是表示在一个周期内的平均每分钟流量和连接数,若出现CPU问题且检测到跑高的进程,则那么该SW_MORE表示在一个周期内的平均每分钟流量和连接数,同时还包含跑高进程信息。The label information is used to indicate the category of the alarm. For example, in this example, the HW field is used as the general label of the hardware alarm. Another example is that the label information can be MEMORY, CPU, IO, etc.; the label information is HW_PARAM, which indicates the information of the corresponding specific abnormality detected, such as LOW_CPU_FREQ, DISK_ERROR, NUMA_TOTAL_UNBALANCE, etc.; the SW label indicates the label of the software-related problem that has occurred, such as HIGH_CPU_PART, MEMORY_SUDDEN_CHANGE, etc., and SW_PARAM is the specific value of the corresponding SW problem detected; SW_MORE contains two meanings, which respectively indicate the average per-minute traffic and number of connections in a cycle. If a CPU problem occurs and a high-running process is detected, then SW_MORE indicates the average per-minute traffic and number of connections in a cycle, and also includes the high-running process information.

将处理后的资源运行数据上报汇总数据库。Report the processed resource operation data to the summary database.

通过资源运行数据中告警的标签信息,便于后续从汇总数据库中快速获取包含指定类型的资源运行数据,提高后续对资源运行数据的处理速度。The label information of the alarm in the resource operation data makes it easier to quickly obtain the resource operation data of the specified type from the summary database, thereby improving the subsequent processing speed of the resource operation data.

若为系统总体不可释放内存突变的类别,则添加标签信息“UNRECLAIM_SURGE”,若为系统SLAB不可释放内存突变,则添加标签信息“SLAB_UNRECLAIM_SURGE”;若为内存突变,则添加标签信息“MEMORY_SUDDEN_CHANGE”;若为内存变化率超过预设的变化率阈值的类别,则添加标签信息MEMORY_USAGE_HIGH;若为SLAB分配器的使用率超过预设的使用率阈值,则添加标签信息SLAB_HIGH。其中,变化率阈值和使用率阈值可以预先根据实际需要进行设置。If it is a category of system overall unreleasable memory mutation, add label information "UNRECLAIM_SURGE", if it is a system SLAB unreleasable memory mutation, add label information "SLAB_UNRECLAIM_SURGE"; if it is a memory mutation, add label information "MEMORY_SUDDEN_CHANGE"; if it is a category of memory change rate exceeding the preset change rate threshold, add label information MEMORY_USAGE_HIGH; if it is a category of SLAB allocator usage exceeding the preset usage threshold, add label information SLAB_HIGH. Among them, the change rate threshold and usage threshold can be set in advance according to actual needs.

具体地,可以检测在当前15分钟的周期内该内存的存在状态的最小/最大跑高持续次数、总计跑高次数以及平均跑高持续次数。Specifically, the minimum/maximum number of high-running durations, the total number of high-running durations, and the average number of high-running durations of the existence state of the memory within the current 15-minute period can be detected.

在另一个例子中,若资源块为负载,那么为该告警添加标签信息如下:当系统上一分钟内的负载达到了预设阈值或CPU跑高但没有高使用率线程时,判断该告警的类别是否为负载指定类别,若为负载指定类别,则为该告警添加指示该负载指定类别的标签信息,负载异常类别包括以下任意一种的组合:异常线程类别,平均等待队列时长超过预设的等待阈值,running态线程超出第一线程阈值,D状态线程超出第二线程阈值。若存在异常线程,则为该告警添加标签信息“ABNORMAL_THREAD”、若存在平均等待队列超过预设的等待阈值,可以为该告警添加标签“LONG_WAIT_QUEUE”、若running态线程超过对应的第一线程阈值,则为该告警添加标签信息“MULTI_R_THREAD”、若D状态线程超过第二线程阈值,则为该告警添加标签信息“MULTI_D_THREAD”。In another example, if the resource block is a load, then the label information added to the alarm is as follows: when the system load within the last minute reaches a preset threshold or the CPU runs high but there is no high-usage thread, determine whether the category of the alarm is a load-specified category. If it is a load-specified category, add label information indicating the load-specified category to the alarm. The load abnormality category includes any combination of the following: abnormal thread category, average waiting queue duration exceeds a preset waiting threshold, running state thread exceeds the first thread threshold, and D state thread exceeds the second thread threshold. If there is an abnormal thread, add the label information "ABNORMAL_THREAD" to the alarm; if there is an average waiting queue that exceeds the preset waiting threshold, add the label "LONG_WAIT_QUEUE" to the alarm; if the running state thread exceeds the corresponding first thread threshold, add the label information "MULTI_R_THREAD" to the alarm; if the D state thread exceeds the second thread threshold, add the label information "MULTI_D_THREAD" to the alarm.

步骤S74:获取在预设时段内内存异常类别告警的持续次数、内存的跑高的总数值以及内存的平均跑高持续次数。Step S74: Obtain the number of continuation of memory abnormality category alarms, the total value of memory highs, and the average number of continuation of memory highs within a preset period of time.

可以进一步地检测在该15分钟周期内产生该告警时内存的最小/最大跑高持续次数、总计跑高次数以及平均跑高持续次数,将检测到的信息作为该告警的告警检测信息。The minimum/maximum number of durations of running high, the total number of durations of running high, and the average number of durations of running high stored in the memory when the alarm is generated within the 15-minute period can be further detected, and the detected information can be used as alarm detection information for the alarm.

需要说明的是,若资源块为流量资源,则获取上个周期内平均每分钟连接数以及进出口流量,如TRAFFIC_PER_MIN_20.44K|8682.6MB|13426.1MB。It should be noted that if the resource block is a traffic resource, the average number of connections per minute and import and export traffic in the previous cycle are obtained, such as TRAFFIC_PER_MIN_20.44K|8682.6MB|13426.1MB.

若资源块为硬件,分别检查机器是否发生了NUMA节点内存不均衡(NUMA_TOTAL_UNBALANCE)、内存坏页(MEM_HARDWARE_ERROR)、磁盘故障(DISK_ERROR)、CPU过热(HIGH_CPU_TMP)、CPU掉线(CPU_OFFLINE)、CPU运行模式错误(LOW_CPU_PERFORMANCE)、CPU降频(LOW_CPU_FREQ)等硬件问题。由于硬件问题一旦发生一般会维持较长时间,所以不需要记录持续次数,仅记录是否发生。If the resource block is hardware, check whether the machine has hardware problems such as NUMA node memory imbalance (NUMA_TOTAL_UNBALANCE), memory bad page (MEM_HARDWARE_ERROR), disk failure (DISK_ERROR), CPU overheating (HIGH_CPU_TMP), CPU offline (CPU_OFFLINE), CPU operation mode error (LOW_CPU_PERFORMANCE), CPU frequency reduction (LOW_CPU_FREQ), etc. Since hardware problems usually last for a long time once they occur, there is no need to record the duration, only record whether they occur.

将每台机器的告警检测信息以及资源数据作为资源运行数据上报至汇总数据库。The alarm detection information and resource data of each machine are reported to the summary database as resource operation data.

步骤S75:丢弃该告警。Step S75: discard the alarm.

本发明第六实施方式涉及一种监控机器性能的系统,其结构框图如图9所示,包括:用于执行上述监控机器性能的方法的监控平台61,以及用于执行上述的监控机器性能的方法的机器62。The sixth embodiment of the present invention relates to a system for monitoring machine performance, and its structural block diagram is shown in FIG9 , which includes: a monitoring platform 61 for executing the above-mentioned method for monitoring machine performance, and a machine 62 for executing the above-mentioned method for monitoring machine performance.

被监控的机器的个数至少是2台,还可以是多台机器,例如,1000台以上的被监控的机器。The number of monitored machines is at least 2, and may be multiple machines, for example, more than 1,000 monitored machines.

监控平台可以是一个服务器,也可以是由多个服务器的组合,另外,该系统中还可以包括汇总数据库,被监控的机器与汇总数据库连接,汇总数据库与监控平台连接。The monitoring platform can be a server or a combination of multiple servers. In addition, the system can also include a summary database. The monitored machines are connected to the summary database, and the summary database is connected to the monitoring platform.

本发明第七实施方式涉及一种网络设备,其结构框图如图10所示,包括:至少一个处理器701;以及,与至少一个处理器701通信连接的存储器702;其中,存储器存储有可被至少一个处理器执行的指令,指令被至少一个处理器701执行,以使至少一个处理器701能够执行上述的监控机器性能的方法,或者,执行上述的监控机器性能的方法。The seventh embodiment of the present invention relates to a network device, a structural block diagram of which is shown in Figure 10, including: at least one processor 701; and a memory 702 communicatively connected to the at least one processor 701; wherein the memory stores instructions that can be executed by at least one processor, and the instructions are executed by at least one processor 701 so that at least one processor 701 can execute the above-mentioned method for monitoring machine performance, or execute the above-mentioned method for monitoring machine performance.

其中,存储器和处理器采用总线方式连接,总线可以包括任意数量的互联的总线和桥,总线将一个或多个处理器和存储器的各种电路链接在一起。总线还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路链接在一起,这些都是本领域所公知的,因此,本文不再对其进行进一步描述。总线接口在总线和收发机之间提供接口。收发机可以是一个元件,也可以是多个元件,比如多个接收器和发送器,提供用于在传输介质上与各种其他装置通信的单元。经处理器处理的数据通过天线在无线介质上进行传输,进一步,天线还接收数据并将数据传送给处理器。Among them, the memory and the processor are connected in a bus manner, and the bus may include any number of interconnected buses and bridges, and the bus links various circuits of one or more processors and memories together. The bus can also link various other circuits such as peripherals, voltage regulators, and power management circuits together, which are all well known in the art, so they are not further described in this article. The bus interface provides an interface between the bus and the transceiver. The transceiver can be one element or multiple elements, such as multiple receivers and transmitters, providing a unit for communicating with various other devices on a transmission medium. The data processed by the processor is transmitted on a wireless medium through an antenna, and further, the antenna also receives data and transmits the data to the processor.

处理器负责管理总线和通常的处理,还可以提供各种功能,包括定时,外围接口,电压调节、电源管理以及其他控制功能。而存储器可以被用于存储处理器在执行操作时所使用的数据。The processor is responsible for managing the bus and general processing, and can also provide various functions, including timing, peripheral interfaces, voltage regulation, power management, and other control functions. Memory can be used to store data used by the processor when performing operations.

本发明第八实施方式涉及一种计算机可读存储介质,存储有计算机程序,计算机程序被处理器执行时实现上述的监控机器性能的方法,或者,实现上述的监控机器性能的方法。The eighth embodiment of the present invention relates to a computer-readable storage medium storing a computer program, which implements the above-mentioned method for monitoring machine performance when executed by a processor, or implements the above-mentioned method for monitoring machine performance.

本领域技术人员可以理解实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-OnlyMemory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。Those skilled in the art can understand that all or part of the steps in the above-mentioned embodiment method can be completed by instructing the relevant hardware through a program, and the program is stored in a storage medium, including several instructions to enable a device (which can be a single-chip microcomputer, chip, etc.) or a processor to execute all or part of the steps of the method described in each embodiment of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk and other media that can store program codes.

本领域的普通技术人员可以理解,上述各实施方式是实现本发明的具体实施例,而在实际应用中,可以在形式上和细节上对其作各种改变,而不偏离本发明的精神和范围。Those skilled in the art will appreciate that the above-mentioned embodiments are specific examples for implementing the present invention, and in actual applications, various changes may be made thereto in form and detail without departing from the spirit and scope of the present invention.

Claims (11)

1.一种监控机器性能的方法,其特征在于,包括:1. A method for monitoring machine performance, comprising: 获取预设周期内至少两个被监控的机器各自的资源运行数据,所述资源运行数据包括所述机器中每个资源块运行时产生的告警以及与所述告警对应的告警检测信息,所述告警检测信息包括以下信息的任意组合:所述告警持续时长的信息、持续次数信息、所述告警触发时的流量信息;Obtain resource operation data of at least two monitored machines within a preset period, the resource operation data including an alarm generated when each resource block in the machine is running and alarm detection information corresponding to the alarm, the alarm detection information including any combination of the following information: information on the duration of the alarm, information on the number of durations, and traffic information when the alarm is triggered; 获取所述资源运行数据中每个告警的标签信息,所述标签信息用于指示所述告警的告警所属类别,由各所述机器根据所述告警的类别标注获得;Acquire label information of each alarm in the resource operation data, where the label information is used to indicate the category to which the alarm belongs, and is obtained by each of the machines according to the category label of the alarm; 根据每个所述告警的标签信息,从所述资源运行数据中获取属于异常类别的所述告警作为异常告警;According to the label information of each of the alarms, the alarm belonging to the abnormal category is obtained from the resource operation data as an abnormal alarm; 根据所述异常告警对应的告警检测信息以及所述异常告警,确定每个所述异常类别的分析数据,所述分析数据包括:所述异常类别的异常告警所在机器的信息;Determine analysis data for each of the abnormal categories according to the alarm detection information corresponding to the abnormal alarm and the abnormal alarm, the analysis data including: information of the machine where the abnormal alarm of the abnormal category is located; 根据接收的显示指令,对获取的所述异常告警和/或所述分析数据进行可视化;Visualizing the acquired abnormal alarm and/or the analysis data according to the received display instruction; 所述根据所述异常告警对应的告警检测信息以及所述异常告警,确定每个所述异常类别相关的分析数据,包括:The determining, according to the alarm detection information corresponding to the abnormal alarm and the abnormal alarm, analysis data related to each abnormal category includes: 根据每个所述告警检测信息以及所述异常告警,统计每个所述异常类别中的异常告警的持续时长、所述异常告警所在机器的位置信息和所述异常告警所在机器的类型分布信息;或者,According to each of the alarm detection information and the abnormal alarm, statistics are collected on the duration of the abnormal alarm in each of the abnormal categories, the location information of the machine where the abnormal alarm is located, and the type distribution information of the machine where the abnormal alarm is located; or, 根据每个所述告警检测信息以及所述异常告警,获取满足预设分析条件的进程作为关键进程,统计所述关键进程中所述异常告警的持续时长、每个所述异常告警所在机器的类型和所述异常告警所在机器的位置信息;或者,According to each of the alarm detection information and the abnormal alarm, a process that meets the preset analysis conditions is obtained as a key process, and the duration of the abnormal alarm in the key process, the type of each machine where the abnormal alarm is located, and the location information of the machine where the abnormal alarm is located are counted; or, 根据每个所述告警检测信息以及所述异常告警,统计所述异常告警在所述周期内每个指定时段内的机器数目、用于指示所述机器数目占被监控机器的总数的占比信息;或者,According to each of the alarm detection information and the abnormal alarm, count the number of machines of the abnormal alarm in each specified time period within the cycle, and use the information indicating the proportion of the number of machines to the total number of monitored machines; or 根据每个所述告警检测信息以及所述异常告警,统计每个所述异常类别分布的机器类型、所述异常类别分布的机器的内核版本信息、所述异常类别所在产品线信息、或者异常进程;或者,According to each of the alarm detection information and the abnormal alarm, statistics are collected on the machine type of each abnormal category distribution, the kernel version information of the machine of the abnormal category distribution, the product line information of the abnormal category, or the abnormal process; or, 根据每个所述告警检测信息以及所述异常告警,根据统计的所述异常告警,获取运行最差的K个机器的信息,K为大于1的整数;According to each of the alarm detection information and the abnormal alarm, and according to the statistical abnormal alarm, obtain information of the worst-performing K machines, where K is an integer greater than 1; 所述异常类别包括流量告警类别、硬件告警类别、处理器跑高类别、进程跑高类别。The abnormal categories include traffic alarm category, hardware alarm category, processor high running category, and process high running category. 2.根据权利要求1所述的监控机器性能的方法,其特征在于,所述根据每个所述告警的标签信息,从所述资源运行数据中获取属于异常类别的所述告警作为异常告警,包括:2. The method for monitoring machine performance according to claim 1, characterized in that the step of obtaining the alarm belonging to the abnormal category from the resource operation data as the abnormal alarm according to the label information of each alarm comprises: 若所述标签信息用于指示所述告警属于流量告警类别,则获取每分钟连接数累计值超过连接数阈值的告警作为所述异常告警,或者,获取每分钟流入的流量超过流入流量阈值的所述告警作为所述异常告警,或者,获取每分钟流出的流量超过流出阈值的所述告警作为所述异常告警;If the label information is used to indicate that the alarm belongs to the flow alarm category, then an alarm that the cumulative value of the number of connections per minute exceeds the connection number threshold is obtained as the abnormal alarm, or an alarm that the inflow flow per minute exceeds the inflow flow threshold is obtained as the abnormal alarm, or an alarm that the outflow flow per minute exceeds the outflow threshold is obtained as the abnormal alarm; 若所述标签信息用于指示所述告警属于硬件告警类别,则将所述标签信息对应的所述告警作为所述异常告警;If the label information is used to indicate that the alarm belongs to the hardware alarm category, the alarm corresponding to the label information is used as the abnormal alarm; 若所述标签信息用于指示所述告警属于处理器跑高类别,则获取在当前周期内跑高次数超过跑高阈值的所述告警作为所述异常告警;If the tag information is used to indicate that the alarm belongs to the category of processor high running, then obtaining the alarm whose high running times in the current cycle exceeds the high running threshold as the abnormal alarm; 若所述标签信息包括指示所述告警属于进程跑高类别,则获取在当前周期内进程跑高次数超过进程跑高阈值的所述告警作为所述异常告警。If the tag information includes an indication that the alarm belongs to a process high running category, the alarm whose process high running times exceed a process high running threshold in the current cycle is obtained as the abnormal alarm. 3.根据权利要求1所述的监控机器性能的方法,其特征在于,所述方法还包括:获取所述机器在产生所述异常告警时的质量变化数据,所述质量变化数据用于指示所述机器运行过程中质量指标数据变化的数据;3. The method for monitoring machine performance according to claim 1, characterized in that the method further comprises: obtaining quality change data of the machine when the abnormal alarm is generated, the quality change data being used to indicate the change of quality indicator data during the operation of the machine; 关联所述质量变化数据和所述异常告警;Associating the quality change data with the abnormal alarm; 可视化关联的所述质量变化数据和所述异常告警。The quality change data and the abnormal alarm associated with each other are visualized. 4.根据权利要求1所述的监控机器性能的方法,其特征在于,所述获取预设周期内至少两个被监控的机器各自的资源运行数据,包括:4. The method for monitoring machine performance according to claim 1, wherein the step of obtaining resource operation data of at least two monitored machines within a preset period comprises: 从汇总数据库中获取所述预设周期内指定机器的资源运行数据,所述汇总数据库用于接收各个被监控的机器上报的资源运行数据。The resource operation data of the specified machine within the preset period is obtained from the summary database, and the summary database is used to receive the resource operation data reported by each monitored machine. 5.一种监控机器性能的方法,其特征在于,应用于被监控的机器,包括:5. A method for monitoring machine performance, characterized in that it is applied to the monitored machine and comprises: 获取所述机器中每个资源块运行时产生的资源数据,所述资源数据包括告警;Acquire resource data generated when each resource block in the machine is running, wherein the resource data includes alarms; 根据每个所述资源块对应的分析策略,获取每个所述资源块中所述告警的告警检测信息以及为每个所述资源数据中所述告警添加标签信息,标签信息用于指示所述告警的告警所属类别;According to the analysis strategy corresponding to each resource block, alarm detection information of the alarm in each resource block is obtained and label information is added to the alarm in each resource data, where the label information is used to indicate the category to which the alarm belongs; 根据所述告警检测信息以及所述资源数据,确定所述机器的资源运行数据,以供监控平台执行如权利要求1至4中任一项所述的监控机器性能的方法。The resource operation data of the machine is determined according to the alarm detection information and the resource data, so that the monitoring platform can execute the method for monitoring machine performance as described in any one of claims 1 to 4. 6.根据权利要求5所述的监控机器性能的方法,其特征在于,若所述资源块为CPU,根据每个所述资源块对应的分析策略,为每个所述资源数据中所述告警添加标签信息,包括:6. The method for monitoring machine performance according to claim 5, characterized in that if the resource block is a CPU, adding label information to the alarm in each resource data according to the analysis strategy corresponding to each resource block comprises: 若检测到CPU对应的所述告警达到预设的跑高阈值后,则为所述告警的添加指示所述告警属于所述CPU跑高类型的标签信息,所述跑高类型包括:全核跑高类别或部分跑高类别;和/或,If it is detected that the alarm corresponding to the CPU reaches a preset high threshold, label information indicating that the alarm belongs to the CPU high type is added to the alarm, and the high type includes: full core high category or partial high category; and/or, 根据所述告警获取所述CPU的跑高线程,判断所述跑高线程是否为内核线程,若所述跑高线程为内核线程,则判断所述跑高线程是否满足性能分析的阈值,若是,则进行内核分析,获取用于表征分析结果的字段信息,将所述字段信息作为所述告警的标签信息,若不满足阈值,则丢弃所述告警;若所述跑高线程为服务线程,则进行所述内核分析,获取用于表征所述分析结果的字段信息,将所述字段信息作为所述告警的标签信息。According to the alarm, the high-running thread of the CPU is obtained, and it is determined whether the high-running thread is a kernel thread. If the high-running thread is a kernel thread, it is determined whether the high-running thread meets the threshold of performance analysis. If so, kernel analysis is performed to obtain field information used to characterize the analysis result, and the field information is used as label information of the alarm. If the threshold is not met, the alarm is discarded; if the high-running thread is a service thread, the kernel analysis is performed to obtain field information used to characterize the analysis result, and the field information is used as label information of the alarm. 7.根据权利要求5所述的监控机器性能的方法,其特征在于,若所述资源块为输入/输出I/O,根据每个所述资源块对应的分析策略,为每个所述资源数据中所述告警添加标签信息,包括:7. The method for monitoring machine performance according to claim 5, characterized in that if the resource block is an input/output I/O, adding label information to the alarm in each resource data according to the analysis strategy corresponding to each resource block comprises: 根据所述告警获取CPU中的用于指示等待 I/O 完成时间的第一数据;Acquire first data indicating the waiting time for I/O completion in the CPU according to the alarm; 若所述第一数据大于I/O阈值,则触发对I/O进行检测,查找第二数据超过检测阈值的盘作为高盘,所述第二数据用于指示盘在每秒内资源的运行占比;If the first data is greater than the I/O threshold, the I/O detection is triggered, and the disk whose second data exceeds the detection threshold is found as the high disk, and the second data is used to indicate the operation ratio of the disk resources per second; 针对每个所述高盘进行如下处理:若检测到请求服务时间大于时间阈值,则为所述告警添加指示服务时间异常的标签信息;若队列等待时间超过等待阈值,则为所述告警添加指示等待时间异常的标签信息;若检测到单位时间内读写量超过预设的读写阈值,则为所述告警添加指示吞吐量异常的标签信息。The following processing is performed for each high disk: if it is detected that the request service time is greater than the time threshold, label information indicating that the service time is abnormal is added to the alarm; if the queue waiting time exceeds the waiting threshold, label information indicating that the waiting time is abnormal is added to the alarm; if it is detected that the read and write volume per unit time exceeds the preset read and write threshold, label information indicating that the throughput is abnormal is added to the alarm. 8.根据权利要求5所述的监控机器性能的方法,其特征在于,若所述资源块为内存,根据每个所述资源块对应的分析策略,为每个所述资源数据中所述告警添加标签信息,包括:8. The method for monitoring machine performance according to claim 5, characterized in that if the resource block is a memory, adding label information to the alarm in each resource data according to the analysis strategy corresponding to each resource block comprises: 根据获取的内存信息,确定所述机器中的内存异常类别;Determining a memory anomaly category in the machine based on the acquired memory information; 判断所述内存异常类别是否属于指定类别,若是,则为所述告警添加指示所述指定类别的标签信息,并获取在预设时段内所述内存异常类别告警的持续次数、内存的跑高的总数值以及内存的平均跑高持续次数,所述指定类别包括以下任意组合:系统不可释放内存突变、系统中SLAB分配器不可释放内存突变、内存突变、内存变化率超过预设的变化率阈值以及所述SLAB分配器的使用率超过预设的使用率阈值。Determine whether the memory exception category belongs to a specified category. If so, add label information indicating the specified category to the alarm, and obtain the number of durations of the memory exception category alarms, the total value of memory highs, and the average number of durations of memory highs within a preset time period. The specified category includes any combination of the following: system unreleasable memory mutation, system SLAB allocator unreleasable memory mutation, memory mutation, memory change rate exceeding a preset change rate threshold, and the usage rate of the SLAB allocator exceeding a preset usage rate threshold. 9.一种监控机器性能的系统,其特征在于,包括:用于执行如权利要求1-4中任一项所述监控机器性能的方法的监控平台,以及用于执行如权利要求5-8中任一项所述的监控机器性能的方法的机器。9. A system for monitoring machine performance, characterized in that it comprises: a monitoring platform for executing the method for monitoring machine performance as described in any one of claims 1-4, and a machine for executing the method for monitoring machine performance as described in any one of claims 5-8. 10.一种网络设备,其特征在于,包括:10. A network device, comprising: 至少一个处理器;以及,at least one processor; and, 与所述至少一个处理器通信连接的存储器;其中,a memory communicatively connected to the at least one processor; wherein, 所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如权利要求1-4任一所述的监控机器性能的方法,或者,执行如权利要求5-8任一所述的监控机器性能的方法。The memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute the method for monitoring machine performance as described in any one of claims 1-4, or execute the method for monitoring machine performance as described in any one of claims 5-8. 11.一种计算机可读存储介质,存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至4中任一项所述的监控机器性能的方法,或者,实现权利要求5至8中任一项所述的监控机器性能的方法。11. A computer-readable storage medium storing a computer program, characterized in that when the computer program is executed by a processor, the method for monitoring machine performance described in any one of claims 1 to 4 is implemented, or the method for monitoring machine performance described in any one of claims 5 to 8 is implemented.
CN202110003255.3A 2021-01-04 2021-01-04 Method, system, network device and storage medium for monitoring machine performance Active CN112699007B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110003255.3A CN112699007B (en) 2021-01-04 2021-01-04 Method, system, network device and storage medium for monitoring machine performance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110003255.3A CN112699007B (en) 2021-01-04 2021-01-04 Method, system, network device and storage medium for monitoring machine performance

Publications (2)

Publication Number Publication Date
CN112699007A CN112699007A (en) 2021-04-23
CN112699007B true CN112699007B (en) 2024-09-20

Family

ID=75514508

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110003255.3A Active CN112699007B (en) 2021-01-04 2021-01-04 Method, system, network device and storage medium for monitoring machine performance

Country Status (1)

Country Link
CN (1) CN112699007B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113568822B (en) * 2021-08-03 2023-09-05 安天科技集团股份有限公司 Service resource monitoring method, device, computing equipment and storage medium
CN113608990B (en) * 2021-10-08 2022-02-01 上海豪承信息技术有限公司 Terminal performance detection method, device and storage medium
CN114157553B (en) * 2021-12-08 2024-06-18 深圳前海微众银行股份有限公司 Data processing method, device, equipment and storage medium
CN114661515B (en) * 2022-05-23 2022-09-20 武汉四通信息服务有限公司 Alarm information convergence method and device, electronic equipment and storage medium
CN115866511B (en) * 2022-11-18 2023-11-24 东土科技(宜昌)有限公司 Method and device for monitoring hardware equipment in positioning system

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107658980A (en) * 2017-09-29 2018-02-02 国网浙江省电力公司 A kind of analysis method and system for being used to check power system monitor warning information

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11277420B2 (en) * 2017-02-24 2022-03-15 Ciena Corporation Systems and methods to detect abnormal behavior in networks
CN107196804B (en) * 2017-06-01 2020-07-10 国网山东省电力公司信息通信公司 Power system terminal communication access network alarm centralized monitoring system and method
CN112073208B (en) * 2019-05-25 2022-01-14 成都华为技术有限公司 Alarm analysis method, device, chip system and storage medium
CN110221936A (en) * 2019-06-12 2019-09-10 深圳前海微众银行股份有限公司 Database alert processing method, device, equipment and computer readable storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107658980A (en) * 2017-09-29 2018-02-02 国网浙江省电力公司 A kind of analysis method and system for being used to check power system monitor warning information

Also Published As

Publication number Publication date
CN112699007A (en) 2021-04-23

Similar Documents

Publication Publication Date Title
CN112699007B (en) Method, system, network device and storage medium for monitoring machine performance
CN112988398B (en) Micro-service dynamic scaling and migration method and device
CN104407964B (en) A kind of centralized monitoring system and method based on data center
EP3745272B1 (en) An application performance analyzer and corresponding method
US8531984B2 (en) Recording medium storing analysis program, analyzing method, and analyzing apparatus
CN107943668A (en) Computer server cluster daily record monitoring method and monitor supervision platform
JP5434562B2 (en) Operation management program, operation management apparatus, and operation management method
JP2007207173A (en) Performance analysis program, performance analysis method, and performance analysis apparatus
JP2002342128A (en) Method to extract health of service from host machine
CN101505243A (en) Performance exception detecting method for Web application
JP2019507454A (en) How to identify the root cause of problems observed while running an application
CN108390793A (en) A kind of method and device of analysis system stability
CN111371570B (en) A fault detection method and device for an NFV network
CN114253806A (en) Access stratum log collection, analysis and early warning system
CN108055152B (en) Anomaly detection method of communication network information system based on distributed service log
CN118171010B (en) Web page performance detection method, device and electronic equipment
CN113409876A (en) Method and system for positioning fault hard disk
CN110928750B (en) Data processing method, device and equipment
CN102930046B (en) Data processing method, computing node and system
CN114090382B (en) Health inspection method and device for super-converged cluster
CN115543665A (en) Memory reliability evaluation method and device and storage medium
CN115913895B (en) Method, device, equipment and medium for diagnosing and alarming server faults
CN118626341B (en) State monitoring method and system for network on chip
CN109766243A (en) A multi-core host performance monitoring method based on power function
CN108845907A (en) A kind of test method being operating abnormally based on IPMITool analysis CPU

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant