CN102761448A

CN102761448A - Cluster monitoring and early warning method

Info

Publication number: CN102761448A
Application number: CN2012102776022A
Authority: CN
Inventors: 俞辉; 高传俊
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2012-08-07
Filing date: 2012-08-07
Publication date: 2012-10-31

Abstract

The invention discloses a cluster monitoring and early warning method, which adopts a grouping mechanism to adapt to clusters of different scales and responds to large-scale clusters in real time, adopts a topology structure to solve single-point failures of Groups, and monitors clusters in real time by combining monitoring and early warning. Through real-time analysis, the data collected by monitoring is compared with the performance index of the system. Once a certain data is found to exceed the threshold of the performance index, it will be sent to the user by SMS to notify the user to solve the fault in time.

Description

Fleet monitoring and early warning methods

技术领域： Technical field:

本发明涉及一种机群监控与预警方法，尤其是采用分组机制适应不同规模的机群以及对大规模机群的实时响应，同时采用拓扑结构解决Group的单点故障，且采用监控与预警相结合的方法达到用户对机群实时监控的目的。The invention relates to a cluster monitoring and early warning method, especially adopting a grouping mechanism to adapt to clusters of different scales and real-time response to large-scale clusters, and using a topology structure to solve single-point failures of Groups, and adopting a method of combining monitoring and early warning To achieve the purpose of real-time monitoring of the user cluster.

背景技术： Background technique:

在传统的机群监控系统中，开源项目Ganglia很好的实现了对具有2000节点的机群规模进行监控。Ganglia是一个跨平台可扩展的，高性能计算系统下的分布式监控系统。它是基于分层设计，利用精心设计的数据结构和算法实现节点之间的低并发性。但是，Ganglia不支持单点故障的处理，即当服务器出现故障时，需要人工处理。同时，由于近年互联网的高速发展，机群的规模已远远超过了2000节点，而Ganglia在随着机群规模的扩展其监控性能并不能得到及时性的响应。目前机群监控技术均是针对某一特殊机群平台而设计的，导致机群监控技术没有一定的通用性，同时传统的监控技术存在不支持单点故障处理问题且不能提供预警的方案。In the traditional cluster monitoring system, the open source project Ganglia has achieved a good monitoring of the cluster size with 2000 nodes. Ganglia is a cross-platform scalable distributed monitoring system under high-performance computing systems. It is based on a hierarchical design, utilizing well-designed data structures and algorithms to achieve low concurrency between nodes. However, Ganglia does not support the handling of single point of failure, that is, when the server fails, manual handling is required. At the same time, due to the rapid development of the Internet in recent years, the size of the cluster has far exceeded 2000 nodes, and Ganglia cannot get a timely response to its monitoring performance as the scale of the cluster expands. At present, the cluster monitoring technology is designed for a specific cluster platform, which leads to the fact that the cluster monitoring technology does not have certain versatility. At the same time, the traditional monitoring technology does not support single-point failure processing and cannot provide early warning solutions.

发明内容： Invention content:

本发明所要解决的技术问题是为了克服现有技术存在的缺陷而提供一种机群监控与预警方法，以适应不同规模的机群以及对大规模机群的实时响应，其能监控的机群规模将远远超过2000个节点，同时，解决单点故障处理问题。The technical problem to be solved by the present invention is to provide a cluster monitoring and early warning method in order to overcome the defects existing in the prior art, so as to adapt to clusters of different scales and real-time response to large-scale clusters, and the scale of clusters that can be monitored will be far More than 2000 nodes, at the same time, solve the problem of single point of failure handling.

本发明为解决其技术问题所采取的技术路线是：首先采用分组思想，将机群分成N组，确定每组的节点数，即每一组由一个Group和M台Agent组成；然后采用星形拓扑结构解决单点故障问题，将称为ControlNode的节点作为中心节点，每组中的Group和SecondaryGroup与ControlNode直接相连，构成一个星型拓扑结构，ControlNode实时记录着每个Group与SecondaryGroup的映射关系，一旦Group出现故障，该Group下的所有Agent将会暂时连接到SecondaryGroup，当故障被消除后，Agent又会连接回Group；最后采用监控与预警相结合的方法，实时挖掘由监控生成的数据，通过与系统性能指标相比较，当发现某个节点超过性能阈值的时候，将会以短信或者邮箱的方式通知指定用户，具体包含以下过程：The technical route that the present invention takes for solving its technical problem is: first adopt grouping thought, machine group is divided into N groups, determines the node number of each group, promptly each group is made up of a Group and M agent; Then adopt star topology The structure solves the single point of failure problem. The node called ControlNode is used as the central node. The Group and SecondaryGroup in each group are directly connected to the ControlNode to form a star topology. The ControlNode records the mapping relationship between each Group and the SecondaryGroup in real time. If a Group fails, all Agents under the Group will temporarily connect to the SecondaryGroup. When the fault is eliminated, the Agent will connect back to the Group. Finally, the method of combining monitoring and early warning is adopted to mine the data generated by monitoring in real time. Compared with the system performance indicators, when a node is found to exceed the performance threshold, the specified user will be notified by SMS or email, which specifically includes the following process:

(1)机群分组(1) Fleet grouping

根据机群的规模确定将机群分成N组，其中，clusterSize为机群的总节点数，Divide the cluster into N groups according to the size of the cluster, where clusterSize is the total number of nodes in the cluster,

$N N = = \{\begin{matrix} \frac{clusterSize clusterSize}{100100} & clusterSize clusterSize > > 100100 \\ 11 & clusterSize clusterSize \leq \leq 100100 \end{matrix}, - - - - - - ((11)),,$

则每组的节点数Then the number of nodes in each group

然后根据公式(2)确定每组的节点数M，多余的节点数采用平均分配到随机组中，每一组均有一个服务器，称为Group，其下的所有节点均由代理负责采集信息，称为Agent，Agent采集的信息分为静态信息和动态信息，机群分组包含以下执行步骤：Then determine the number M of nodes in each group according to the formula (2), and distribute the redundant nodes evenly to random groups. Each group has a server called Group, and all nodes under it are responsible for collecting information by agents. It is called Agent, and the information collected by Agent is divided into static information and dynamic information, and cluster grouping includes the following execution steps:

每一组由一台服务器和M个节点组成，其中，服务器又被称为Group。其下的M个节点均由代理负责采集信息，称为Agent。Agent采集的信息分为静态信息和动态信息，静态信息则是指那些在某一段时间内不会变化的软硬件信息，见表1，动态信息是指那些实时变化的信息，见表2。Each group consists of a server and M nodes, where the server is also called a Group. The M nodes under it are all responsible for collecting information by an agent, which is called Agent. The information collected by the Agent is divided into static information and dynamic information. Static information refers to software and hardware information that does not change within a certain period of time, see Table 1, and dynamic information refers to information that changes in real time, as shown in Table 2.

表1静态信息表Table 1 static information table

表2动态信息表Table 2 Dynamic information table

机群分组包含以下执行步骤：Fleet grouping consists of the following execution steps:

①Agent将信息通过通信协议定时交给Group处理；①Agent sends the information to the Group through the communication protocol at regular intervals for processing;

②Group对信息进行分类，将信息分为即时信息和历史信息，又将历史信息分成1月历史信息和3月历史信息；②Group classifies information, divides information into real-time information and historical information, and divides historical information into January historical information and March historical information;

③Group定时将上述信息写入指定数据库中，以供用户实时监控且为预警方法提供数据基础；③Group regularly writes the above information into the designated database for real-time monitoring by users and provides a data basis for early warning methods;

④Group对Agent的响应时间一般为3秒，基本符合目前大部分机群规模实时响应需求。④The response time of the Group to the Agent is generally 3 seconds, which basically meets the real-time response requirements of most current clusters.

(2)解决单点故障(2) Solve single point of failure

Group存在单点故障，即当某个Group出现故障时，该Group下的Agent将不能工作。采用Group的冗余机制和星形拓扑结构，为每个Group设计一个备用Group，称作SecondaryGroup。SecondaryGroup与Group具有同样的功能，但是当没有Agent与SecondaryGroup通信时，SecondaryGroup只开启一个监听线程，不断地监听是否有Agent连接进来，一旦有Agent连接进来，SecondaryGroup将会启动数据处理功能。由于Group与SecondaryGroup的灵活切换需要一个中心节点去处理，因此又引入星形拓扑结构。其中心节点为一台服务器，又称为ControlNode，所有Group和SecondaryGroup与ControlNode直接相连，由此便形成了一个星形拓扑结构。解决单点故障包含以下具体步骤：There is a single point of failure in a group, that is, when a group fails, the Agent under this group will not work. Using Group redundancy mechanism and star topology, design a backup Group for each Group, called SecondaryGroup. SecondaryGroup has the same function as Group, but when there is no Agent communicating with SecondaryGroup, SecondaryGroup only starts a monitoring thread to continuously monitor whether there is an Agent connection. Once an Agent is connected, SecondaryGroup will start the data processing function. Since the flexible switching between Group and SecondaryGroup requires a central node to handle, a star topology is introduced. The central node is a server, also known as ControlNode, and all Groups and SecondaryGroups are directly connected to ControlNode, thus forming a star topology. Addressing a single point of failure involves the following specific steps:

①Agent在启动时记录一个Group与SecondaryGroup的映射关系；①Agent records the mapping relationship between a Group and SecondaryGroup at startup;

②ControlNode实时记录着每个Group与SecondaryGroup的映射关系；②ControlNode records the mapping relationship between each Group and SecondaryGroup in real time;

③一旦某个Group出现故障时，Agent将会自动识别到当前的Group已经出现故障，Agent会自动与SecondaryGroup建立通讯，将采集的信息交给SecondaryGroup处理；③Once a Group fails, the Agent will automatically recognize that the current Group has failed, and the Agent will automatically establish communication with the SecondaryGroup, and hand over the collected information to the SecondaryGroup for processing;

④ControlNode与此同时将映射Group-->SecondaryGroup打上标记，表示该Group已经出现故障，需要进行人工恢复；④ControlNode marks the mapping Group-->SecondaryGroup at the same time, indicating that the Group has failed and needs to be manually restored;

⑤当Group的故障恢复时，ControlNode将会取消此映射的标记，同时通知SecondaryGroup暂定处理由Agent采集的信息且通过SecondaryGroup告知Agent Group的故障已经解决；⑤When the failure of the Group recovers, the ControlNode will cancel the marking of this mapping, and at the same time notify the SecondaryGroup to temporarily process the information collected by the Agent and notify the Agent Group through the SecondaryGroup that the failure of the Agent Group has been resolved;

⑥Agent接受到指令之后，重新与Group建立通讯，Group单点故障解决。⑥ After the Agent receives the instruction, it establishes communication with the Group again, and the single point of failure of the Group is resolved.

(3)监控与预警相结合的方法(3) The method of combining monitoring and early warning

通过挖掘由Agent采集的即时信息和历史信息，对每个节点的性能进行评判，查看CPU的空闲时间是否不到2%，内存使用率是否大于80%~99%，磁盘IO次数是否太频繁，以及网络通讯是否异常，从而达到预警的目的，按如下步骤执行：By mining the real-time information and historical information collected by the Agent, evaluate the performance of each node to check whether the idle time of the CPU is less than 2%, whether the memory usage is greater than 80% to 99%, and whether the number of disk IOs is too frequent. And whether the network communication is abnormal, so as to achieve the purpose of early warning, follow the steps below:

①Group将由Agent采集的数据定时存入指定数据库中，数据被分为1个月历史信息、3个月历史信息和即时信息；①Group stores the data collected by the Agent in the designated database regularly, and the data is divided into 1-month historical information, 3-month historical information and real-time information;

②对即时信息进行分析，若发现某个节点的信息长时间得不到更新，则可判断该节点已经出现故障；②Analyze real-time information. If it is found that the information of a certain node has not been updated for a long time, it can be judged that the node has failed;

③对历史信息进行挖掘，分别对1个月历史信息和3个月历史信息中的进程、CPU、内存、磁盘IO以及网络流量进行分析，分析指标如下：③ Mining historical information, respectively analyzing the process, CPU, memory, disk IO and network traffic in the historical information of 1 month and 3 months, the analysis indicators are as follows:

a.进程信息包括了1分钟运行进程数、5分钟运行进程数和15分钟运行进程数，如果某个时间段里运行的进程数过多，则此时会检查CPU信息；a. The process information includes the number of running processes in 1 minute, the number of running processes in 5 minutes and the number of running processes in 15 minutes. If there are too many running processes in a certain period of time, the CPU information will be checked at this time;

b.CPU的信息包括了用户时间、NICE时间、系统时间、I/O时间以及空闲时间，如果发现CPU的空闲时间不足2%，则说明该节点运行的任务超过了该节点能承受的负荷，而如果进程数不多，则可能是CPU出现了瓶颈，预警方法将会结果以短信和邮箱的方法发送给指定的用户。b. CPU information includes user time, NICE time, system time, I/O time, and idle time. If the idle time of the CPU is found to be less than 2%, it means that the task running on the node exceeds the load that the node can bear. And if the number of processes is not much, then there may be a bottleneck in the CPU, and the early warning method will send the result to the designated user in the form of a text message or email.

c.内存的信息包括了总内存、使用内存和空闲内存，通过计算每个时间段内的内存使用率，若使用率超过了80%～99%，则该节点的内存存在明显的不足，则通知用户有必要扩展该节点的内存；c. Memory information includes total memory, used memory and free memory. By calculating the memory usage rate in each time period, if the usage rate exceeds 80% to 99%, the memory of the node is obviously insufficient, then Inform the user that it is necessary to expand the memory of this node;

d.磁盘IO信息包括了每秒IO次数、读速度和写速度，如果磁盘在某段时间内IO次数太多，则该节点的磁盘读写太过于频繁，已经达到了磁盘的瓶颈，则会通知用户应该减轻该节点的运行任务或者更换更好的硬件设备；d. Disk IO information includes the number of IOs per second, read speed, and write speed. If the disk has too many IO times within a certain period of time, the disk reads and writes of the node are too frequent and have reached the bottleneck of the disk. Notify the user that the running tasks of the node should be reduced or replaced with better hardware devices;

a.网络流量包括了IP接收包率、IP回应包率、IP请求包率、TCP接收段率、TCP发送段率、TCP重发段率、UDP接收包率和UDP发送包率，通过分析这些数据，可以得出近来网络通讯是否正常，若发现丢包率过高，则网络一定出现了异常，则通知用户查验交换器等硬件设备，从而做到预防作用。a. Network traffic includes IP receiving packet rate, IP response packet rate, IP request packet rate, TCP receiving segment rate, TCP sending segment rate, TCP resending segment rate, UDP receiving packet rate and UDP sending packet rate, by analyzing these From the data, it can be concluded whether the recent network communication is normal. If the packet loss rate is found to be too high, there must be an abnormality in the network, and the user will be notified to check the switch and other hardware devices, so as to prevent it.

本发明的有益效果是：首先采用分组思想，将机群分成N组，确定每组的节点数，然后采用星形拓扑结构解决单点故障问题，最后采用监控与预警相结合的方法，实时挖掘由监控生成的数据，通过与系统性能指标相比较，发现问题并及时通知用户做好预警工作，相比于传统的机群监控方法，本发明适用于所有规模的机群，且随着机群规模的增大，依然具有及时性和稳定性的优点，在技术上，本发明摒弃了传统的分层思想而改用自主设计的分组思想，使得本发明更具有科学性和实用的特点。同时本发明结合了星型拓朴结构和预警方法，使得本发明具有更加的稳定性和完美性。The beneficial effects of the present invention are as follows: firstly adopt the idea of grouping, divide the machine group into N groups, determine the number of nodes in each group, then adopt the star topology to solve the problem of single point failure, and finally adopt the method of combining monitoring and early warning to realize real-time mining by The data generated by monitoring is compared with the system performance indicators to find problems and notify the user in time to do the early warning work. Compared with the traditional cluster monitoring method, the present invention is applicable to clusters of all sizes, and with the increase of cluster size , still has the advantages of timeliness and stability. Technically, the present invention abandons the traditional hierarchical thinking and uses the self-designed grouping thinking, which makes the present invention more scientific and practical. Simultaneously, the present invention combines the star topological structure and the early warning method, so that the present invention has more stability and perfection.

附图说明： Description of drawings:

图1为本发明的拓朴图。Fig. 1 is a topological diagram of the present invention.

图2为分组思想结构图。Figure 2 is a structural diagram of grouping ideas.

图3为单点故障解决方案结构图。Figure 3 is a structural diagram of a single point of failure solution.

图4为预警方法流程图。Fig. 4 is a flowchart of the early warning method.

具体实施方式： Detailed ways:

下面结合某油田研究院具有2000节点集群规模的实例对本发明做进一步的描述。The present invention will be further described below in conjunction with an example of an oilfield research institute having a cluster scale of 2000 nodes.

(1)机群分组(1) Fleet grouping

根据公式(1)将机群分成

(组)，则然后根据公式(2)，得到每组的节点数

多余的节点数采用平均分配到随机组中，每一组均有一个服务器，称为Group，则该机群中共有20个Group，20个Secondary Group，剩下的节点除了一个用来作为ControlNode，剩下的所有节点均由Agent负责采集表1中的静态信息和表2中的动态信息。According to the formula (1), the fleet is divided into

(group), then according to formula (2), get the number of nodes in each group

The number of redundant nodes is evenly distributed to random groups. Each group has a server called Group. There are 20 Groups and 20 Secondary Groups in the cluster. The remaining nodes are used as ControlNodes. All the nodes below are responsible for collecting the static information in Table 1 and the dynamic information in Table 2 by the Agent.

(2)环境搭建(2) Environment construction

根据上述分组方法，得到如下具体环境：According to the above grouping method, the following specific environment is obtained:

ControlNode(1个)：cp2001，IP地址：168.173.2.1ControlNode (1): cp2001, IP address: 168.173.2.1

Group(20个)：cp2002~cp2021，IP地址：168.173.2.2~168.173.2.21Group (20): cp2002~cp2021, IP address: 168.173.2.2~168.173.2.21

SecondaryGroup(20个)：cp2022~cp2041，IP地址：168.173.2.22～168.173.2.41剩下的1959个节点作为Agent。SecondaryGroup (20): cp2022~cp2041, IP address: 168.173.2.22~168.173.2.41 The remaining 1959 nodes are used as Agents.

Group与Agent的映射如表3。The mapping between Group and Agent is shown in Table 3.

表3Group与Agent映射表Table 3 Group and Agent mapping table

根据该环境，得到如下步骤，以表3中第一个组别为例，其中Group为节点cp2002，Agent为节点cp2042~cp2138：According to the environment, the following steps are obtained, taking the first group in Table 3 as an example, where Group is node cp2002, and Agent is node cp2042~cp2138:

①cp2042~cp2138将信息通过通信协议定时交给cp2002处理；① cp2042~cp2138 hand over the information to cp2002 for processing regularly through the communication protocol;

②cp2002对信息进行分类，将信息分为即时信息和历史信息，又将历史信息分成1月历史信息和3月历史信息；②cp2002 classifies information, divides information into instant information and historical information, and divides historical information into January historical information and March historical information;

③cp2002定时将上述信息写入指定数据库中，以供用户实时监控且为预警方法提供数据基础；③cp2002 regularly writes the above information into the designated database for real-time monitoring by users and provides data basis for early warning methods;

④cp2002对所有cp2042~cp2138的响应时间均为3秒。④The response time of cp2002 to all cp2042~cp2138 is 3 seconds.

(3)解决单点故障(3) Solve single point of failure

单点故障依然以第一组为例，其中中控节点为cp2001，Group为节点cp2002，The single point of failure still takes the first group as an example, in which the central control node is cp2001, the group is the node cp2002,

SecondaryGroup为cp2022，Agent为节点cp2042~cp2138：SecondaryGroup is cp2022, Agent is node cp2042~cp2138:

ControlNode将上述的20个Group连接成一个星型拓扑结构，即cp2002→cp2003→……→cp2021→cp2002。若cp2002发生故障，则cp2042~cp2138则会自动连接到cp2002对应的cp2022。一旦cp2002故障修复好，节点cp2042~cp2138则又会自动连接到cp2002。下面是具体步骤：ControlNode connects the above 20 Groups into a star topology, that is, cp2002→cp2003→...→cp2021→cp2002. If cp2002 fails, cp2042~cp2138 will automatically connect to cp2022 corresponding to cp2002. Once the failure of cp2002 is repaired, nodes cp2042~cp2138 will automatically connect to cp2002 again. Here are the specific steps:

①Acp2042~cp2138在启动时记录cp2002与cp2022的映射关系；①Acp2042~cp2138 records the mapping relationship between cp2002 and cp2022 when starting;

②cp2001实时记录着cp2002与cp2022的映射关系；②cp2001 records the mapping relationship between cp2002 and cp2022 in real time;

③一旦cp2002出现故障时，cp2042~cp2138将会自动识别到cp2002已经出现故障，cp2042~cp2138会自动与cp2022建立通讯，将采集的信息交给cp2022处理；③Once cp2002 fails, cp2042~cp2138 will automatically recognize that cp2002 has failed, cp2042~cp2138 will automatically establish communication with cp2022, and hand over the collected information to cp2022 for processing;

④cp2001与此同时将映射Group-->SecondaryGroup打上标记，表示cp2002已经出现故障，需要进行人工恢复；④At the same time, cp2001 marks the mapping Group-->SecondaryGroup, indicating that cp2002 has failed and needs to be manually restored;

⑤当cp2002的故障恢复时，cp2001将会取消此映射的标记，同时通知cp2022暂定处理由cp2042~cp2138采集的信息且通过cp2022告知cp2042~cp2138cp2002的故障已经解决；⑤ When the fault of cp2002 recovers, cp2001 will cancel the marking of this mapping, and at the same time notify cp2022 to temporarily process the information collected by cp2042~cp2138 and inform cp2042~cp2138 that the fault of cp2002 has been resolved through cp2022;

⑥cp2042~cp2138接受到指令之后，重新与cp2002建立通讯，cp2002单点故障解决。⑥After cp2042~cp2138 receive the command, establish communication with cp2002 again, and the single point of failure of cp2002 is solved.

(4)监控与预警相结合的方法(4) The method of combining monitoring and early warning

挖掘由Agent采集的即时信息和历史信息，对每个节点的性能进行评判，查看是否超过某个性能阈值，从而达到预警的目的，按如下步骤执行，步骤以节点cp2042为例：Mining the real-time information and historical information collected by the Agent, and evaluating the performance of each node to see if it exceeds a certain performance threshold, so as to achieve the purpose of early warning, follow the steps below, taking the node cp2042 as an example:

1)对即时信息进行分析，若发现cp2042的信息长时间得不到更新，则可判断cp2042已经出现故障；1) Analyze the real-time information, if it is found that the information of cp2042 has not been updated for a long time, it can be judged that cp2042 has failed;

2)对历史信息进行挖掘，分别对1个月历史信息和3个月历史信息中的进程、CPU、内存、磁盘IO以及网络流量进行分析，分析指标如下：2) Mining the historical information, analyzing the process, CPU, memory, disk IO and network traffic in the 1-month historical information and 3-month historical information respectively. The analysis indicators are as follows:

a.如果cp2042运行的进程数过多，则检查CPU信息；a. If there are too many processes running on cp2042, check the CPU information;

b.如果发现CPU的空闲时间不足2%，则说明cp2042运行的任务超过了能承受的负荷，而如果进程数不多，则可能是CPU出现了瓶颈，预警方法将会结果以短信和邮箱的方法发送给指定的用户。b. If the idle time of the CPU is found to be less than 2%, it means that the tasks run by cp2042 have exceeded the load it can bear. If the number of processes is not many, it may be that the CPU has a bottleneck. The early warning method will result in SMS and email method to send to the specified user.

c.计算每个时间段内的内存使用率，若使用率超过了80%～99%，则cp2042的内存存在明显的不足，则通知用户有必要扩展cp2042的内存；c. Calculate the memory usage rate in each time period. If the usage rate exceeds 80% to 99%, the memory of cp2042 is obviously insufficient, and the user will be notified that it is necessary to expand the memory of cp2042;

d.检查磁盘的IO次数，若cp2042的IO次数太多，则cp2042的磁盘读写太过于频繁，已经达到了磁盘的瓶颈，则会通知用户应该减轻cp2042的运行任务或者更换更好的硬件设备；d. Check the IO times of the disk. If the IO times of the cp2042 are too many, the disk read and write of the cp2042 is too frequent and has reached the bottleneck of the disk. The user will be notified that the running tasks of the cp2042 should be reduced or a better hardware device should be replaced. ;

f.分析网络数据，可以得出近来网络通讯是否正常，若发现丢包率过高，则网络一定出现了异常，则通知用户查验交换器等硬件设备，从而做到预防作用。f. Analyze the network data to find out whether the recent network communication is normal. If the packet loss rate is found to be too high, there must be an abnormality in the network, and the user will be notified to check the switch and other hardware devices, so as to prevent it.

经长期试验，本发明占用CPU使用率0%~2%，内存占用80M左右。After long-term testing, the present invention occupies 0% to 2% of the CPU usage, and the memory occupies about 80M.

Claims

1. group monitoring and method for early warning; Adopt grouping mechanism to adapt to the group of planes of different scales and to the real-time response of an extensive group of planes, adopt topological structure to solve the Single Point of Faliure of Group, a group of planes is monitored in real time with monitoring to combine with early warning; It is characterized in that, specifically comprise following process:

(1) group of planes divides into groups

Confirm a group of planes is divided into the N group according to the scale of a group of planes,

wherein; ClusterSize is total node number of a group of planes, then every group node number

Unnecessary node number adopts mean allocation in random groups, and each group all has a server, is called Group; All nodes under it are responsible for Information Monitoring by the agency; Be called Agent, the information that Agent gathers is divided into static information and multidate information, and group of planes grouping comprises following execution in step:

1. Agent regularly gives Group processing with information through communication protocol;

2. Group classifies to information, and information is divided into instant messages and historical information, again with historical information be divided into January historical information with history information in March;

3. Group regularly writes above-mentioned information in the specified database, for the user real time monitoring and for method for early warning the data basis is provided;

4. Group was generally 3 seconds the response time of Agent, met present most of group of planes scale real-time response demand basically;

(2) solve Single Point of Faliure

There is Single Point of Faliure in Group, and promptly when certain Group breaks down, the Agent under this Group can not work; Adopt redundancy scheme and the star topology of Group,, be called SecondaryGroup for subsequent use Group of each Group design; SecondaryGroup and Group have same function, but when not having Agent to communicate by letter with SecondaryGroup, SecondaryGroup has only opened a watcher thread; Whether have Agent connect come in, in case there is Agent to connect to come in, SecondaryGroup will the log-on data processing capacity if constantly monitoring; Because the flexible switching of Group and SecondaryGroup needs a Centroid to go to handle, and therefore introduces star topology again, its Centroid is a server; Be called ControlNode again; All Group directly link to each other with ControlNode with SecondaryGroup, have just formed a star topology thus, solve Single Point of Faliure and comprise following concrete steps:

1. Agent writes down the mapping relations of a Group and SecondaryGroup when starting;

2. the ControlNode real time record the mapping relations of each Group and SecondaryGroup;

In case when 3. certain Group broke down, Agent will recognize current Group automatically and break down, Agent understands automatically and SecondaryGroup sets up communication, gives SecondaryGroup with the information of gathering and handles;

4. ControlNode is meanwhile with mapping G roup-->SecondaryGroup marks, and representes that this Group breaks down, and need carry out manual reversion;

5. when the fault recovery of Group, ControlNode will cancel the mark of this mapping, notifies simultaneously that SecondaryGroup is tentative to be handled the information of being gathered by Agent and inform that through SecondaryGroup the fault of AgentGroup solves;

6. Agent receives after the instruction, sets up communication with Group again, and the Group Single Point of Faliure solves;

(3) monitoring combines with early warning

Through excavating the instant messages and the historical information of gathering by Agent; Performance to each node is passed judgment on, and the free time of checking CPU, whether whether memory usage was greater than 80% ~ 99% less than 2%; Whether the disk I number of times is too frequent; And whether network communication is unusual, thereby reaches the purpose of early warning, carries out as follows:

1. the data that will be gathered by Agent of Group regularly deposit in the specified database by 1 month historical information, 3 months historical information and instant messages;

2. instant messages is analyzed,, can be judged that then this node breaks down if find to have the Chief Information Officer time of node to can not get upgrading;

3. historical information is excavated, respectively process, CPU, internal memory, disk I and network traffics in 1 month historical information and 3 months historical information are analyzed, analysis indexes is following:

A. progress information comprises 1 minute operation process number, 5 minutes operation process numbers and 15 minutes operation process numbers, if the process number of operation is too much, then can check CPU information this moment;

The information of b.CPU comprises user time, NICE time, system time, I/O time and free time; If find the free time less than 2% of CPU; The task that this node operation then is described has surpassed the load that this node can bear; And if the process number is few, then possibly be that bottleneck has appearred in CPU, method for early warning can send to the result user of appointment with the method for note and mail;

C. the information of internal memory comprises total internal memory, uses internal memory and free memory, and through calculating the memory usage in each time period, if utilization rate has surpassed 80%～99%, then the internal memory of this node is significantly not enough, then notifies the user to be necessary to expand the internal memory of this node;

D. disk I information comprises per second IO number, reading rate and writing rate; If IO number of disk is too many; Then the disk read-write of this node too in frequently, has reached the bottleneck of disk, and the operation task that then can notify the user should alleviate this node is perhaps changed better hardware device;

E. network traffics comprise that IP receives the bag rate, IP responds bag rate, IP request package rate, TCP receiver section rate, TCP delivery section rate, a TCP repeating transmission section rate, UDP reception bag rate and UDP and sends the bag rate; Through analyzing these data; Whether normal, if find that packet loss is too high, then network has necessarily occurred unusually if can draw recent network communication; Notify the user to check interchanger, thereby accomplish prevention effect.