[go: up one dir, main page]

CN102497292A - Method and system for monitoring computer cluster - Google Patents

Method and system for monitoring computer cluster Download PDF

Info

Publication number
CN102497292A
CN102497292A CN201110391562XA CN201110391562A CN102497292A CN 102497292 A CN102497292 A CN 102497292A CN 201110391562X A CN201110391562X A CN 201110391562XA CN 201110391562 A CN201110391562 A CN 201110391562A CN 102497292 A CN102497292 A CN 102497292A
Authority
CN
China
Prior art keywords
monitoring
node
monitored
module
monitored node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201110391562XA
Other languages
Chinese (zh)
Inventor
卢威
白利达
陈岚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Microelectronics of CAS
Original Assignee
Institute of Microelectronics of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Microelectronics of CAS filed Critical Institute of Microelectronics of CAS
Priority to CN201110391562XA priority Critical patent/CN102497292A/en
Publication of CN102497292A publication Critical patent/CN102497292A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

本发明实施例提出了一种计算机集群监控的方法,包括以下步骤:被监控结点进行运行信息采集,将所述结点当前负载状态及被监控的内容信息分别发送给参数调整模块和主监控模块;所述主监控模块接收所述结点当前负载状态及被监控的内容信息,将所述结点的负载状态和被监控的内容信息存入数据库;所述参数调整模块根据所述结点当前负载状态进行分析,当负载状态达到预设阈值时,调整被监控结点的监控策略,并将更新后的监控策略通知被监控结点。本发明提出的上述方法,根据系统的负载状态,合理定制监控内容,可有效控制在系统高负载运行时监控程序所占资源,能够方便快捷的获取集群监控状态和报警信息。

Figure 201110391562

The embodiment of the present invention proposes a computer cluster monitoring method, including the following steps: the monitored node collects operation information, and sends the current load status of the node and the monitored content information to the parameter adjustment module and the main monitoring module; the main monitoring module receives the current load status of the node and the monitored content information, and stores the load status of the node and the monitored content information into the database; the parameter adjustment module according to the node The current load status is analyzed, and when the load status reaches the preset threshold, the monitoring strategy of the monitored node is adjusted, and the updated monitoring strategy is notified to the monitored node. The method proposed by the present invention rationally customizes the monitoring content according to the load status of the system, can effectively control the resources occupied by the monitoring program when the system is running under high load, and can conveniently and quickly obtain cluster monitoring status and alarm information.

Figure 201110391562

Description

计算机集群监控的方法及系统Method and system for computer cluster monitoring

技术领域 technical field

本发明涉及计算机通信领域,具体而言,本发明涉及计算机集群监控的方法及系统。The present invention relates to the field of computer communication, in particular, the present invention relates to a computer cluster monitoring method and system.

背景技术 Background technique

计算机集群简称集群,是一种计算机系统,它通过一组松散集成的计算机软件和/或硬件连接起来,高度紧密地协作完成计算工作。在某种意义上,它们可以被看作是一台计算机。集群系统中的单个计算机通常称为结点,通常通过局域网连接,但也有其它的可能连接方式。集群计算机通常用来改进单个计算机的计算速度和/或可靠性。一般情况下集群计算机比单个计算机,比如工作站或超级计算机性能价格比要高得多。A computer cluster, referred to as a cluster for short, is a computer system that is connected through a group of loosely integrated computer software and/or hardware, and highly closely cooperates to complete computing work. In a sense, they can be thought of as a computer. The individual computers in a cluster system are usually called nodes and are usually connected by a local area network, but there are other possible connections. Cluster computers are often used to improve the computing speed and/or reliability of individual computers. In general, cluster computers are much more cost-effective than individual computers, such as workstations or supercomputers.

集群应用对于现代日益增多的计算需求非常重要,可以有效的减少运算时间和充分应用服务器硬件资源。系统管理员需要及时掌握集群当前的运行状态及资源的使用情况,故而需要实时的对集群进行监控。Cluster applications are very important to the increasing computing needs of modern times, which can effectively reduce computing time and make full use of server hardware resources. System administrators need to keep abreast of the current running status and resource usage of the cluster, so they need to monitor the cluster in real time.

现有的WEB方式的集群监控已有一些成熟产品,但主要存在以下几个问题:一是监控内容固定,不可以自定制;二是存在着监控的及时性、完整性与计算性能之间的矛盾。There are some mature products in the existing WEB cluster monitoring, but there are mainly the following problems: first, the monitoring content is fixed and cannot be customized; second, there is a gap between the timeliness, integrity and computing performance of monitoring. contradiction.

因此,有必要提出一种有效的技术方案,解决现有的WEB方式中计算机集群监控的问题。Therefore, it is necessary to propose an effective technical solution to solve the problem of computer cluster monitoring in the existing WEB mode.

发明内容 Contents of the invention

本发明的目的旨在至少解决上述技术缺陷之一,特别是通过调整被监控结点的监控策略,优化系统的监控性能。The purpose of the present invention is to at least solve one of the above-mentioned technical defects, especially to optimize the monitoring performance of the system by adjusting the monitoring strategy of the monitored nodes.

本发明实施例提出了一种计算机集群监控的方法,包括以下步骤:The embodiment of the present invention proposes a method for computer cluster monitoring, comprising the following steps:

被监控结点进行运行信息采集,将所述结点当前负载状态及被监控的内容信息分别发送给参数调整模块和主监控模块;The monitored node collects operation information, and sends the current load status of the node and the monitored content information to the parameter adjustment module and the main monitoring module respectively;

所述主监控模块接收所述结点当前负载状态及被监控的内容信息,将所述结点的负载状态和被监控的内容信息存入数据库;The main monitoring module receives the current load status of the node and the monitored content information, and stores the load status of the node and the monitored content information into a database;

所述参数调整模块根据所述结点当前负载状态进行分析,当负载状态达到预设阈值时,调整被监控结点的监控策略,并将更新后的监控策略通知所述被监控结点。The parameter adjustment module analyzes the current load status of the node, adjusts the monitoring strategy of the monitored node when the load status reaches a preset threshold, and notifies the monitored node of the updated monitoring strategy.

本发明提出的上述方案,根据系统的负载状态,合理定制监控内容,可有效控制在系统高负载运行时监控程序所占资源,能够方便快捷的获取集群监控状态和报警信息。此外,本发明提出的上述方案,对现有系统的改动很小,不会影响系统的兼容性,而且实现简单、高效。The above solution proposed by the present invention rationally customizes the monitoring content according to the load state of the system, can effectively control the resources occupied by the monitoring program when the system is running under high load, and can conveniently and quickly obtain cluster monitoring status and alarm information. In addition, the above solution proposed by the present invention has little modification to the existing system, does not affect the compatibility of the system, and is simple and efficient to implement.

本发明附加的方面和优点将在下面的描述中部分给出,这些将从下面的描述中变得明显,或通过本发明的实践了解到。Additional aspects and advantages of the invention will be set forth in part in the description which follows, and will become apparent from the description, or may be learned by practice of the invention.

附图说明 Description of drawings

本发明上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中:The above and/or additional aspects and advantages of the present invention will become apparent and easy to understand from the following description of the embodiments in conjunction with the accompanying drawings, wherein:

图1为本发明实施例计算机集群监控的方法流程图;Fig. 1 is the method flowchart of computer cluster monitoring of the embodiment of the present invention;

图2为本发明实施例计算机集群监控的系统结构图。Fig. 2 is a system structure diagram of computer cluster monitoring according to an embodiment of the present invention.

具体实施方式 Detailed ways

下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本发明,而不能解释为对本发明的限制。Embodiments of the present invention are described in detail below, examples of which are shown in the drawings, wherein the same or similar reference numerals designate the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the figures are exemplary only for explaining the present invention and should not be construed as limiting the present invention.

为了现实本发明之目的,本发明实施例提出了一种计算机集群监控的方法,包括以下步骤:In order to achieve the purpose of the present invention, the embodiment of the present invention proposes a method for computer cluster monitoring, comprising the following steps:

被监控结点进行运行信息采集,将所述结点当前负载状态及被监控的内容信息分别发送给参数调整模块和主监控模块;The monitored node collects operation information, and sends the current load status of the node and the monitored content information to the parameter adjustment module and the main monitoring module respectively;

所述主监控接收所述结点当前负载状态及被监控的内容信息,将所述结点的负载状态和被监控的内容信息存入数据库;The main monitor receives the current load status of the node and the monitored content information, and stores the load status of the node and the monitored content information into a database;

所述参数调整模块根据所述结点当前负载状态进行分析,当负载状态达到预设阈值时,调整被监控结点的监控策略,并将更新后的监控策略通知所述被监控结点。The parameter adjustment module analyzes the current load status of the node, adjusts the monitoring strategy of the monitored node when the load status reaches a preset threshold, and notifies the monitored node of the updated monitoring strategy.

如图1所示,为本发明实施例计算机集群监控的方法流程图,包括以下步骤:As shown in Figure 1, it is a flow chart of a method for computer cluster monitoring in an embodiment of the present invention, including the following steps:

S110:被监控结点进行运行信息采集,将结点当前负载状态及被监控的内容信息分别发送给参数调整模块和主监控模块。S110: The monitored node collects operation information, and sends the current load status of the node and the monitored content information to the parameter adjustment module and the main monitoring module respectively.

在步骤S110中,集群中各被监控结点首先运行信息采集,将结点当前负载状态及被监控内容分别发送到参数调整模块和集群管理结点的主监控模块。In step S110, each monitored node in the cluster first runs information collection, and sends the current load status of the node and the monitored content to the parameter adjustment module and the main monitoring module of the cluster management node respectively.

S120:主监控模块接收结点当前负载状态及被监控的内容信息,将结点的负载状态和被监控的内容信息存入数据库。S120: The main monitoring module receives the current load status of the node and the monitored content information, and stores the node load status and the monitored content information into the database.

主监控模块分析和处理得到的信息,将结点的状态和监控内容存入数据库。此外,主监控模块还可以提供一个WEB服务,可通过网页查看监控结果。The main monitoring module analyzes and processes the obtained information, and stores the status and monitoring content of the nodes into the database. In addition, the main monitoring module can also provide a WEB service, and the monitoring results can be viewed through the web page.

S130:参数调整模块根据结点当前负载状态进行分析,当负载状态达到预设阈值时,调整监控策略,并将更新后的监控策略通知被监控结点。S130: The parameter adjustment module analyzes the current load status of the node, adjusts the monitoring strategy when the load status reaches a preset threshold, and notifies the monitored node of the updated monitoring strategy.

参数调整模块根据当前的负载情况进行分析,负载的计算是内存、CPU、运行队列平均长度、I/O及网络传输量的综合考虑,如达到预设阈值则进行监控策略调整,调整被监控结点的监控策略包括:The parameter adjustment module analyzes the current load situation. The calculation of the load is a comprehensive consideration of memory, CPU, average length of the running queue, I/O and network transmission volume. If the preset threshold is reached, the monitoring strategy will be adjusted to adjust the monitored structure Point monitoring strategies include:

以网络响应时间、CPU使用率或内存占用率的变化确定监控策略。Determine the monitoring strategy based on changes in network response time, CPU usage, or memory usage.

例如,当被监控结点总体负载上升时:For example, when the overall load of the monitored nodes increases:

若网络响应时间增加,延长被监控结点信息采集模块运行时间间隔;若CPU使用率上升,降低被监控结点信息采集模块运行优先级;若内存占用率上升,在被监控结点上运行轻量级监控引擎。If the network response time increases, extend the running time interval of the monitored node information collection module; if the CPU usage increases, reduce the running priority of the monitored node information collection module; if the memory usage increases, run light on the monitored node Quantitative monitoring engine.

例如,当被监控结点总体负载不变或下降时:For example, when the overall load of the monitored nodes remains constant or decreases:

若网络响应时间减少,减少被监控结点信息采集模块运行时间间隔直至默认值;若CPU使用率下降,增加被监控结点信息采集模块运行优先级直至默认值;若内存占用率上升,在被监控结点上切换回默认监控引擎。If the network response time decreases, reduce the running time interval of the monitored node information collection module to the default value; if the CPU usage drops, increase the running priority of the monitored node information collection module to the default value; Switch back to the default monitoring engine on the monitoring node.

对于其他未说明情况,可以将已有参数保持不变。For other unspecified cases, the existing parameters can be kept unchanged.

此外,总体负载长期超过阈值则连接报警装置进行报警或远程重启被监控结点。In addition, if the overall load exceeds the threshold for a long time, an alarm device will be connected to alarm or the monitored node will be restarted remotely.

本发明提出的上述方法,可以实现基于WEB的集群监控,可自定制监控内容,同时可有效控制在系统高负载运行时监控程序所占资源,能够方便快捷的获取集群监控状态和报警信息。The method proposed by the present invention can realize cluster monitoring based on WEB, can customize monitoring content, can effectively control the resources occupied by the monitoring program when the system is running under high load, and can obtain cluster monitoring status and alarm information conveniently and quickly.

为实现上述目的,如图2所示,本发明实施例还提供了一种计算机集群监控的系统,包括信息采集模块200、主监控模块100以及参数调整模块300。To achieve the above purpose, as shown in FIG. 2 , an embodiment of the present invention also provides a computer cluster monitoring system, including an information collection module 200 , a main monitoring module 100 and a parameter adjustment module 300 .

信息采集模块200用于在被监控结点进行运行信息采集,将结点当前负载状态及被监控的内容信息分别发送给参数调整模块300和主监控模块100。The information collection module 200 is used to collect operation information on the monitored node, and send the current load status of the node and the monitored content information to the parameter adjustment module 300 and the main monitoring module 100 respectively.

主监控模块100用于接收结点当前负载状态及被监控的内容信息,将结点的负载状态和被监控的内容信息存入数据库。The main monitoring module 100 is used to receive the current load status of the node and the monitored content information, and store the node load status and the monitored content information in the database.

主监控模块100提供WEB服务,用于通过网页查看被监控的内容信息。The main monitoring module 100 provides WEB service for viewing the monitored content information through the webpage.

参数调整模块300用于根据结点当前负载状态进行分析,当负载状态达到预设阈值时,调整被监控结点的监控策略,并将更新后的监控策略通知信息采集模块200。The parameter adjustment module 300 is used to analyze according to the current load status of the node, and when the load status reaches a preset threshold, adjust the monitoring strategy of the monitored node, and notify the information collection module 200 of the updated monitoring strategy.

参数调整模块300根据结点当前负载状态进行分析包括:The analysis performed by the parameter adjustment module 300 according to the current load state of the node includes:

分析被监控结点的以下一种或多个参数:Analyze one or more of the following parameters of the monitored nodes:

内存使用率、CPU运行状态、运行队列长度、磁盘I/O、进程组及网络传输速率。Memory usage, CPU running status, run queue length, disk I/O, process group and network transfer rate.

参数调整模块300调整监控策略包括:The parameter adjustment module 300 adjusts the monitoring strategy to include:

以网络响应时间、CPU使用率或内存占用率的变化确定监控策略。Determine the monitoring strategy based on changes in network response time, CPU usage, or memory usage.

例如,当被监控结点总体负载上升时:For example, when the overall load of the monitored nodes increases:

若网络响应时间增加,延长被监控结点信息采集模块运行时间间隔;若CPU使用率上升,降低被监控结点信息采集模块运行优先级;若内存占用率上升,在被监控结点上运行轻量级监控引擎。If the network response time increases, extend the running time interval of the monitored node information collection module; if the CPU usage increases, reduce the running priority of the monitored node information collection module; if the memory usage increases, run light on the monitored node Quantitative monitoring engine.

例如,当被监控结点总体负载不变或下降时:For example, when the overall load of the monitored nodes remains constant or decreases:

若网络响应时间减少,减少被监控结点信息采集模块运行时间间隔直至默认值;若CPU使用率下降,增加被监控结点信息采集模块运行优先级直至默认值;若内存占用率上升,在被监控结点上切换回默认监控引擎。If the network response time decreases, reduce the running time interval of the monitored node information collection module to the default value; if the CPU usage drops, increase the running priority of the monitored node information collection module to the default value; Switch back to the default monitoring engine on the monitoring node.

对于其他未说明情况,可以将已有参数保持不变。For other unspecified cases, the existing parameters can be kept unchanged.

此外,总体负载长期超过阈值则连接报警装置进行报警或远程重启被监控结点。In addition, if the overall load exceeds the threshold for a long time, an alarm device will be connected to alarm or the monitored node will be restarted remotely.

应当了解,图2只是便于说明而将本发明提出的各个单元或模板集中在一块中描述。显然,本发明提出的各个单元或模板也可以以分离模块的形式存在于具体的计算机网络系统中实现。例如,将信息采集模块200和参数调整模块300置于在被监控结点,将主监控模块100置于某一监控主机上,等等。It should be understood that FIG. 2 is only for the sake of illustration and describes all the units or templates proposed in the present invention together. Apparently, each unit or template proposed by the present invention can also be implemented in a specific computer network system in the form of separate modules. For example, the information collection module 200 and the parameter adjustment module 300 are placed on the monitored node, the main monitoring module 100 is placed on a certain monitoring host, and so on.

例如,系统总体结构如下:For example, the overall system structure is as follows:

信息采集模块200中的信息采集程序运行于被监控的结点上,负责对集群进行监控以采集获取集群结点的运行状态与需要监控的信息,结点直接与主监控模块100通信,信息采集模块内又设置多个策略,可根据主监控模块100提供的扩展接口进行自定制监控内容。主监控模块100中的主监控程序运行在监控主机上,收集各信息采集程序的数据并保存在数据库中。参数调整模块300中的参数调整程序根据各结点的运行负载情况调整各结点的监控策略;报警装置根据集群系统的预设故障方案进行邮件和/或短信告警或远程重启被监控结点。The information collection program in the information collection module 200 runs on the monitored nodes, and is responsible for monitoring the cluster to collect and obtain the running status of the cluster nodes and the information to be monitored. The nodes directly communicate with the main monitoring module 100, and the information collection Multiple policies are set in the module, and the monitoring content can be customized according to the extended interface provided by the main monitoring module 100 . The main monitoring program in the main monitoring module 100 runs on the monitoring host, collects the data of each information collection program and saves it in the database. The parameter adjustment program in the parameter adjustment module 300 adjusts the monitoring strategy of each node according to the operating load of each node; the alarm device sends an email and/or SMS alarm or remotely restarts the monitored node according to the preset failure scheme of the cluster system.

例如,所述信息采集程序由一个主模块、一个通讯模块和多个功能模块组成。主模块接收来自参数调整程序的指令并配置各功能模块。功能模块分为集群状态及负载监控模块,轻量级监控引擎和默认监控引擎,默认监控引擎可以通过配置用户脚本自定制监控对象。For example, the information collection program is composed of a main module, a communication module and multiple functional modules. The main module receives instructions from the parameter adjustment program and configures each functional module. Functional modules are divided into cluster status and load monitoring module, lightweight monitoring engine and default monitoring engine. The default monitoring engine can customize monitoring objects by configuring user scripts.

例如,所述参数调整程序包含一个策略选择器,通过负载状态进行优先级、时间间隔及监控引擎切换。For example, the parameter tuning program includes a policy selector for priority, time interval and supervisory engine switching by load status.

本发明提出的上述装置,可以实现基于WEB的集群监控,可自定制监控内容,同时可有效控制在系统高负载运行时监控程序所占资源,能够方便快捷的获取集群监控状态和报警信息。The above-mentioned device proposed by the present invention can realize WEB-based cluster monitoring, can customize monitoring content, can effectively control the resources occupied by the monitoring program when the system is running under high load, and can conveniently and quickly obtain cluster monitoring status and alarm information.

本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。Those of ordinary skill in the art can understand that all or part of the steps carried by the methods of the above embodiments can be completed by instructing related hardware through a program, and the program can be stored in a computer-readable storage medium. During execution, one or a combination of the steps of the method embodiments is included.

此外,在本发明各个实施例中的各功能单元可以集成在一个处理模块中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing module, each unit may exist separately physically, or two or more units may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules. If the integrated modules are realized in the form of software function modules and sold or used as independent products, they can also be stored in a computer-readable storage medium.

上述提到的存储介质可以是只读存储器,磁盘或光盘等。The storage medium mentioned above may be a read-only memory, a magnetic disk or an optical disk, and the like.

以上所述仅是本发明的部分实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above descriptions are only part of the embodiments of the present invention. It should be pointed out that those skilled in the art can make some improvements and modifications without departing from the principles of the present invention. It should be regarded as the protection scope of the present invention.

Claims (12)

1.一种计算机集群监控的方法,其特征在于,包括以下步骤:1. a method for computer cluster monitoring, is characterized in that, comprises the following steps: 被监控结点进行运行信息采集,将所述结点当前负载状态及被监控的内容信息分别发送给参数调整模块和主监控模块;The monitored node collects operation information, and sends the current load status of the node and the monitored content information to the parameter adjustment module and the main monitoring module respectively; 所述主监控模块接收所述结点当前负载状态及被监控的内容信息,将所述结点的负载状态和被监控的内容信息存入数据库;The main monitoring module receives the current load status of the node and the monitored content information, and stores the load status of the node and the monitored content information into a database; 所述参数调整模块根据所述结点当前负载状态进行分析,当负载状态达到预设阈值时,调整被监控结点的监控策略,并将更新后的监控策略通知所述被监控结点。The parameter adjustment module analyzes the current load status of the node, adjusts the monitoring strategy of the monitored node when the load status reaches a preset threshold, and notifies the monitored node of the updated monitoring strategy. 2.如权利要求1所述的计算机集群监控的方法,其特征在于,还包括:所述主监控模块提供WEB服务,用于通过网页查看被监控的内容信息。2 . The computer cluster monitoring method according to claim 1 , further comprising: the main monitoring module providing WEB services for viewing the monitored content information through a web page. 3 . 3.如权利要求1所述的计算机集群监控的方法,其特征在于,所述参数调整模块根据所述结点当前负载状态进行分析包括:3. The method for computer cluster monitoring as claimed in claim 1, wherein said parameter adjustment module analyzes according to the current load state of said node comprises: 分析所述被监控结点的以下一种或多个参数:Analyzing one or more of the following parameters of the monitored node: 内存使用率、CPU运行状态、运行队列长度、磁盘I/O、进程组及网络传输速率。Memory usage, CPU running status, run queue length, disk I/O, process group and network transfer rate. 4.如权利要求1所述的计算机集群监控的方法,其特征在于,调整被监控结点的监控策略包括:4. the method for computer cluster monitoring as claimed in claim 1, is characterized in that, adjusting the monitoring policy of monitored node comprises: 以网络响应时间、CPU使用率或内存占用率的变化确定监控策略。Determine the monitoring strategy based on changes in network response time, CPU usage, or memory usage. 5.如权利要求4所述的计算机集群监控的方法,其特征在于,当被监控结点总体负载上升时,所述监控策略包括以下一种或多种方式:5. the method for computer cluster monitoring as claimed in claim 4, is characterized in that, when the overall load of monitored node rises, described monitoring strategy comprises following one or more modes: 如果网络响应时间增加,延长被监控结点信息采集模块运行时间间隔;If the network response time increases, extend the running time interval of the monitored node information collection module; 如果CPU使用率上升,降低被监控结点信息采集模块运行优先级;If the CPU usage rate rises, reduce the running priority of the monitored node information collection module; 如果内存占用率上升,在被监控结点上运行轻量级监控引擎。If memory usage increases, run a lightweight monitoring engine on the monitored node. 6.如权利要求4所述的计算机集群监控的方法,其特征在于,当被监控结点总体负载不变或下降时,所述监控策略包括以下一种或多种方式:6. The method for computer cluster monitoring as claimed in claim 4, characterized in that, when the overall load of the monitored node is constant or declines, the monitoring strategy includes one or more of the following methods: 如果网络响应时间减少,减少被监控结点信息采集模块运行时间间隔直至默认值;If the network response time decreases, reduce the running time interval of the monitored node information collection module to the default value; 如果CPU使用率下降,增加被监控结点信息采集模块运行优先级直至默认值;If the CPU usage drops, increase the running priority of the monitored node information collection module to the default value; 如果内存占用率上升,在被监控结点上切换回默认监控引擎。If the memory usage increases, switch back to the default monitoring engine on the monitored node. 7.一种计算机集群监控的系统,其特征在于,包括信息采集模块、主监控模块以及参数调整模块,7. A system for computer cluster monitoring, characterized in that it comprises an information collection module, a main monitoring module and a parameter adjustment module, 所述信息采集模块,用于在被监控结点上采集运行信息,将所述结点当前负载状态及被监控的内容信息分别发送给所述参数调整模块和所述主监控模块;The information collection module is used to collect operation information on the monitored node, and send the current load status of the node and the monitored content information to the parameter adjustment module and the main monitoring module respectively; 所述主监控模块,用于接收所述结点当前负载状态及被监控的内容信息,将所述结点的负载状态和被监控的内容信息存入数据库;The main monitoring module is configured to receive the current load status of the node and the monitored content information, and store the load status of the node and the monitored content information into a database; 所述参数调整模块,用于根据所述结点当前负载状态进行分析,当负载状态达到预设阈值时,调整被监控结点的监控策略,并将更新后的监控策略通知所述信息采集模块。The parameter adjustment module is configured to analyze according to the current load state of the node, and when the load state reaches a preset threshold, adjust the monitoring strategy of the monitored node, and notify the information collection module of the updated monitoring strategy . 8.如权利要求7所述的计算机集群监控的设备,其特征在于,还包括:所述主监控模块提供WEB服务,用于通过网页查看被监控的内容信息。8 . The computer cluster monitoring device according to claim 7 , further comprising: the main monitoring module provides WEB services for viewing monitored content information through web pages. 9.如权利要求7所述的计算机集群监控的设备,其特征在于,所述参数调整模块根据所述结点当前负载状态进行分析包括:9. The equipment for computer cluster monitoring as claimed in claim 7, wherein the parameter adjustment module performing analysis according to the current load state of the node comprises: 分析所述被监控结点的以下一种或多个参数:Analyzing one or more of the following parameters of the monitored node: 内存使用率、CPU运行状态、运行队列长度、磁盘I/O、进程组及网络传输速率。Memory usage, CPU running status, run queue length, disk I/O, process group and network transfer rate. 10.如权利要求7所述的计算机集群监控的设备,其特征在于,所述参数调整模块调整被监控结点的监控策略包括:10. The equipment of computer cluster monitoring as claimed in claim 7, is characterized in that, the monitoring strategy of described parameter adjustment module adjustment monitored node comprises: 以网络响应时间、CPU使用率或内存占用率的变化确定监控策略。Determine the monitoring strategy based on changes in network response time, CPU usage, or memory usage. 11.如权利要求10所述的计算机集群监控的设备,其特征在于,当被监控结点总体负载上升时,所述监控策略包括以下一种或多种方式:11. The equipment for computer cluster monitoring as claimed in claim 10, characterized in that, when the overall load of the monitored nodes rises, the monitoring strategy includes one or more of the following methods: 如果网络响应时间增加,延长被监控结点信息采集模块运行时间间隔;If the network response time increases, extend the running time interval of the monitored node information collection module; 如果CPU使用率上升,降低被监控结点信息采集模块运行优先级;If the CPU usage rate rises, reduce the running priority of the monitored node information collection module; 如果内存占用率上升,在被监控结点上运行轻量级监控引擎。If memory usage increases, run a lightweight monitoring engine on the monitored node. 12.如权利要求10所述的计算机集群监控的设备,其特征在于,当被监控结点总体负载不变或下降时,所述监控策略包括以下一种或多种方式:12. The equipment for computer cluster monitoring as claimed in claim 10, wherein when the overall load of the monitored node is constant or drops, the monitoring strategy includes one or more of the following methods: 如果网络响应时间减少,减少被监控结点信息采集模块运行时间间隔直至默认值;If the network response time decreases, reduce the running time interval of the monitored node information collection module to the default value; 如果CPU使用率下降,增加被监控结点信息采集模块运行优先级直至默认值;If the CPU usage drops, increase the running priority of the monitored node information collection module to the default value; 如果内存占用率上升,在被监控结点上切换回默认监控引擎。If the memory usage increases, switch back to the default monitoring engine on the monitored node.
CN201110391562XA 2011-11-30 2011-11-30 Method and system for monitoring computer cluster Pending CN102497292A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110391562XA CN102497292A (en) 2011-11-30 2011-11-30 Method and system for monitoring computer cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110391562XA CN102497292A (en) 2011-11-30 2011-11-30 Method and system for monitoring computer cluster

Publications (1)

Publication Number Publication Date
CN102497292A true CN102497292A (en) 2012-06-13

Family

ID=46189080

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110391562XA Pending CN102497292A (en) 2011-11-30 2011-11-30 Method and system for monitoring computer cluster

Country Status (1)

Country Link
CN (1) CN102497292A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103105923A (en) * 2013-03-07 2013-05-15 鄂尔多斯市云泰互联科技有限公司 Energy-efficient scheduling method and system for information technology (IT) business of cloud computing center
CN103116538A (en) * 2013-01-25 2013-05-22 浪潮电子信息产业股份有限公司 Design for computer performance self-adjusting system
CN103268224A (en) * 2013-05-08 2013-08-28 中国科学院微电子研究所 Software running platform based on web access mode
CN103533058A (en) * 2013-10-17 2014-01-22 南京大学镇江高新技术研究院 HDFS (Hadoop distributed file system)/Hadoop storage cluster-oriented resource monitoring system and HDFS/Hadoop storage cluster-oriented resource monitoring method
CN104104536A (en) * 2013-04-15 2014-10-15 北京中嘉时代科技有限公司 Strategy-based self-adjusting concurrent polling monitoring method and device
CN104834584A (en) * 2015-06-04 2015-08-12 深圳市中博科创信息技术有限公司 Method and system for monitoring host computer hardware loads
CN104898509A (en) * 2015-04-30 2015-09-09 杭州谱谐特科技有限公司 Industrial control computer monitoring method and system based on secure short message
CN104954178A (en) * 2015-05-29 2015-09-30 北京奇虎科技有限公司 Method and device for optimizing system alarm
CN105024880A (en) * 2015-07-17 2015-11-04 哈尔滨工程大学 A Resilient Monitoring Method for Mission-Critical Computer Clusters
CN105515838A (en) * 2015-11-26 2016-04-20 青岛海信传媒网络技术有限公司 Service configuration method and HA (High Available) cluster system
CN103116538B (en) * 2013-01-25 2016-11-30 浪潮电子信息产业股份有限公司 A kind of design for computing power self-regulating system
CN106802853A (en) * 2017-02-17 2017-06-06 郑州云海信息技术有限公司 A kind of system of selection and device based on many monitor modes
CN108449396A (en) * 2018-03-07 2018-08-24 精硕科技(北京)股份有限公司 Distributed Hadoop cluster management methods, main control end and controlled end
CN109614302A (en) * 2018-11-28 2019-04-12 华为技术服务有限公司 Service rate adjustment method and device, and related equipment
CN110222923A (en) * 2015-09-11 2019-09-10 福建师范大学 Dynamically configurable big data analysis system
CN111405246A (en) * 2020-03-12 2020-07-10 高宽友 Smart city monitoring method and device and management terminal
CN114218042A (en) * 2021-12-14 2022-03-22 中国电信股份有限公司 Information processing method, device and system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030101253A1 (en) * 2001-11-29 2003-05-29 Takayuki Saito Method and system for distributing data in a network
CN101207550A (en) * 2007-03-16 2008-06-25 中国科学技术大学 Load balancing system and method for realizing load balancing of multiple services
CN101442561A (en) * 2008-12-12 2009-05-27 南京邮电大学 Method for monitoring grid based on vector machine support
US7564776B2 (en) * 2004-01-30 2009-07-21 Alcatel-Lucent Usa Inc. Method for controlling the transport capacity for data transmission via a network, and network
CN101499935A (en) * 2008-01-30 2009-08-05 中兴通讯股份有限公司 Alarm processing method for WiMAX base station
CN101505302A (en) * 2009-02-26 2009-08-12 中国联合网络通信集团有限公司 Dynamic regulating method and system for security policy
CN101667034A (en) * 2009-09-21 2010-03-10 北京航空航天大学 Scalable monitoring system supporting hybrid clusters

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030101253A1 (en) * 2001-11-29 2003-05-29 Takayuki Saito Method and system for distributing data in a network
US7564776B2 (en) * 2004-01-30 2009-07-21 Alcatel-Lucent Usa Inc. Method for controlling the transport capacity for data transmission via a network, and network
CN101207550A (en) * 2007-03-16 2008-06-25 中国科学技术大学 Load balancing system and method for realizing load balancing of multiple services
CN101499935A (en) * 2008-01-30 2009-08-05 中兴通讯股份有限公司 Alarm processing method for WiMAX base station
CN101442561A (en) * 2008-12-12 2009-05-27 南京邮电大学 Method for monitoring grid based on vector machine support
CN101505302A (en) * 2009-02-26 2009-08-12 中国联合网络通信集团有限公司 Dynamic regulating method and system for security policy
CN101667034A (en) * 2009-09-21 2010-03-10 北京航空航天大学 Scalable monitoring system supporting hybrid clusters

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103116538B (en) * 2013-01-25 2016-11-30 浪潮电子信息产业股份有限公司 A kind of design for computing power self-regulating system
CN103116538A (en) * 2013-01-25 2013-05-22 浪潮电子信息产业股份有限公司 Design for computer performance self-adjusting system
CN103105923B (en) * 2013-03-07 2015-05-27 鄂尔多斯市云泰互联科技有限公司 Energy-efficient scheduling method and system for information technology (IT) business of cloud computing center
CN103105923A (en) * 2013-03-07 2013-05-15 鄂尔多斯市云泰互联科技有限公司 Energy-efficient scheduling method and system for information technology (IT) business of cloud computing center
CN104104536A (en) * 2013-04-15 2014-10-15 北京中嘉时代科技有限公司 Strategy-based self-adjusting concurrent polling monitoring method and device
CN104104536B (en) * 2013-04-15 2018-08-17 北京中嘉时代科技有限公司 A kind of concurrent poll monitoring method of self-regulation and device based on strategy
CN103268224A (en) * 2013-05-08 2013-08-28 中国科学院微电子研究所 Software running platform based on web access mode
CN103533058A (en) * 2013-10-17 2014-01-22 南京大学镇江高新技术研究院 HDFS (Hadoop distributed file system)/Hadoop storage cluster-oriented resource monitoring system and HDFS/Hadoop storage cluster-oriented resource monitoring method
CN103533058B (en) * 2013-10-17 2017-02-08 南京大学镇江高新技术研究院 HDFS (Hadoop distributed file system)/Hadoop storage cluster-oriented resource monitoring system and HDFS/Hadoop storage cluster-oriented resource monitoring method
CN104898509A (en) * 2015-04-30 2015-09-09 杭州谱谐特科技有限公司 Industrial control computer monitoring method and system based on secure short message
CN104898509B (en) * 2015-04-30 2018-04-27 杭州谱谐特科技有限公司 A kind of industrial personal computer monitoring method and system based on secure short message
CN104954178B (en) * 2015-05-29 2019-02-15 北京奇虎科技有限公司 Method and device for optimizing system alarm
CN104954178A (en) * 2015-05-29 2015-09-30 北京奇虎科技有限公司 Method and device for optimizing system alarm
CN104834584A (en) * 2015-06-04 2015-08-12 深圳市中博科创信息技术有限公司 Method and system for monitoring host computer hardware loads
CN104834584B (en) * 2015-06-04 2017-07-11 深圳市中博科创信息技术有限公司 A kind of method and system for monitoring host hardware load
CN105024880A (en) * 2015-07-17 2015-11-04 哈尔滨工程大学 A Resilient Monitoring Method for Mission-Critical Computer Clusters
CN110222923A (en) * 2015-09-11 2019-09-10 福建师范大学 Dynamically configurable big data analysis system
CN105515838A (en) * 2015-11-26 2016-04-20 青岛海信传媒网络技术有限公司 Service configuration method and HA (High Available) cluster system
CN106802853A (en) * 2017-02-17 2017-06-06 郑州云海信息技术有限公司 A kind of system of selection and device based on many monitor modes
CN106802853B (en) * 2017-02-17 2020-08-21 苏州浪潮智能科技有限公司 Selection method and device based on multiple monitoring modes
CN108449396A (en) * 2018-03-07 2018-08-24 精硕科技(北京)股份有限公司 Distributed Hadoop cluster management methods, main control end and controlled end
CN109614302A (en) * 2018-11-28 2019-04-12 华为技术服务有限公司 Service rate adjustment method and device, and related equipment
CN111405246A (en) * 2020-03-12 2020-07-10 高宽友 Smart city monitoring method and device and management terminal
CN111405246B (en) * 2020-03-12 2021-04-06 厦门宇昊软件有限公司 Smart city monitoring method and device and management terminal
CN114218042A (en) * 2021-12-14 2022-03-22 中国电信股份有限公司 Information processing method, device and system

Similar Documents

Publication Publication Date Title
CN102497292A (en) Method and system for monitoring computer cluster
CN107924359B (en) Management of fault conditions in a computing system
EP3072260B1 (en) Methods, systems, and computer readable media for a network function virtualization information concentrator
CN111124819B (en) Method and device for full link monitoring
US11573878B1 (en) Method and apparatus of establishing customized network monitoring criteria
EP3338191B1 (en) Diagnostic framework in computing systems
US8516295B2 (en) System and method of collecting and reporting exceptions associated with information technology services
CN111131379A (en) Distributed flow acquisition system and edge calculation method
CN103684916A (en) Method and system for intelligent monitoring and analyzing under cloud computing
US12035156B2 (en) Communication method and apparatus for plurality of administrative domains
US8954563B2 (en) Event enrichment using data correlation
US20240202010A1 (en) Aggregating metrics of network elements of a software-defined network for different applications based on different aggregation criteria
US20240339834A1 (en) Techniques for orchestrated load shedding
US10970148B2 (en) Method, device and computer program product for managing input/output stack
Sandur et al. Jarvis: Large-scale server monitoring with adaptive near-data processing
KR20250065317A (en) System and method for managing operation in trust reality viewpointing networking infrastructure
CN118648320A (en) Remote logging management in multi-vendor O-RAN networks
US12155210B2 (en) Techniques for orchestrated load shedding
CN107566187B (en) A SLA violation monitoring method, device and system
CN110377396A (en) A kind of virtual machine Autonomic Migration Framework method, system and electronic equipment
CN103812706A (en) Adaptive method for network interface for isomerous manufacturer data network
Mukherjee et al. AMAS: Adaptive auto-scaling for edge computing applications
Kontoudis et al. A statistical approach to virtual server resource management
US20250062614A1 (en) Techniques for orchestrated load shedding
CN120110912A (en) Cloud-network integrated service system and method based on broadband core network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20120613