CN107885549A

CN107885549A - Remove the method and system that process is remained in TORQUE computing cluster calculate nodes

Info

Publication number: CN107885549A
Application number: CN201711137327.3A
Authority: CN
Inventors: 孙金土
Original assignee: Xinyang Normal University
Current assignee: Xinyang Normal University
Priority date: 2017-11-16
Filing date: 2017-11-16
Publication date: 2018-04-06
Anticipated expiration: 2037-11-16
Also published as: CN107885549B

Abstract

The invention belongs to the field of computer technology, and in particular relates to the management of high-performance computer clusters, in particular to a method and system for clearing residual processes in computing nodes of TORQUE computing clusters. The method for clearing the residual processes in the computing nodes of the TORQUE computing cluster includes: obtaining all user names from the user name list file and storing them in the user list; obtaining the user names of running tasks in all current computing nodes and storing them in the relational array; The computing node creates a file with the computing node name as the file name, and stores the user names that currently have no task execution on the computing node in the file; deletes the residual process for each computing node; automatically updates the user name list file. The system for clearing the residual processes in the computing nodes of the TORQUE computing cluster includes: a first acquisition module; a second acquisition module; a file creation module; a residual process deletion module; an update module. The invention can quickly and accurately clear the residual processes in the computing nodes of the TORQUE computing cluster.

Description

Method and system for clearing residual processes in TORQUE computing cluster computing nodes

技术领域technical field

本发明属于计算机技术领域，具体涉及高性能计算机集群的管理，尤其涉及清除TORQUE计算集群计算节点中残留进程的方法及系统。The invention belongs to the field of computer technology, and in particular relates to the management of high-performance computer clusters, in particular to a method and system for clearing residual processes in computing nodes of TORQUE computing clusters.

背景技术Background technique

高性能计算可以用于数值分析和模拟实验，是公认的一种重要研究方法，是科学创新的重要手段。当前，高性能计算能力是一个国家的综合国力的体现，对国家战略有着重要的影响。High-performance computing can be used for numerical analysis and simulation experiments. It is recognized as an important research method and an important means of scientific innovation. At present, high-performance computing capability is the embodiment of a country's comprehensive national strength and has an important impact on national strategy.

提升计算能力有两种方式：一是提升单机的计算能力，例如提升CPU的主频、采用众核技术或使用GPU计算等；另一种方法是同过高速网络（例如InfiniBand）来并行多台计算机。目前，高性能计算集群的搭建往往综合考虑上述两个因素，选一个性价比最高的方案，计算软件、计算结果都保存在文件存储系统中，以保证各节点数据的同步性。作业运行时，管理节点把任务分配到计算节点里进行计算，计算节点里对应产生一个计算进程。当计算完成后，计算节点里计算进程结束，管理节点再也查不到任务的状态。There are two ways to improve computing power: one is to increase the computing power of a single machine, such as increasing the main frequency of the CPU, adopting many-core technology, or using GPU computing, etc.; the other is to use a high-speed network (such as InfiniBand) to parallel multiple machines computer. At present, the construction of high-performance computing clusters often takes the above two factors into consideration, and chooses the most cost-effective solution. The computing software and computing results are stored in the file storage system to ensure the synchronization of data on each node. When the job is running, the management node assigns the task to the computing node for calculation, and a computing process is generated in the computing node. When the calculation is completed, the calculation process in the computing node ends, and the management node can no longer check the status of the task.

在集群实际运行中，常常发现个别计算节点有计算进程在运行，但在管理节点查不到其运行状态。例如第s个计算节点有计算进程在运行，但管理节点认为第s个节点处于空闲状态。如果有新的任务，管理节点会继续向第s个节点提交任务，导致第s个节点多个计算进程同时运行，大大降低计算效率。这里，我们把管理节点查不到的计算进程，称为残留进程。造成残留进程的原因有很多，最主要的是用户程序编写不规范，其次是用户没有通过正确的方法提交计算任务，还有可能是系统bug。总之，残留计算进程严重影响正常的计算任务执行，要清除掉。急需一种快速、准确的清除计算节点中的残留进程的方法及系统。In the actual operation of the cluster, it is often found that individual computing nodes have computing processes running, but their running status cannot be found on the management node. For example, the sth computing node has a computing process running, but the management node thinks that the sth node is in an idle state. If there is a new task, the management node will continue to submit the task to the sth node, causing multiple computing processes of the sth node to run at the same time, which greatly reduces the computing efficiency. Here, we refer to the computing processes that cannot be found by the management node as residual processes. There are many reasons for the residual process, the most important is that the user program is not written in a standardized way, the second is that the user did not submit the computing task through the correct method, and it may be a system bug. In short, the residual computing process seriously affects the execution of normal computing tasks and should be removed. There is an urgent need for a fast and accurate method and system for clearing residual processes in computing nodes.

发明内容Contents of the invention

本发明的目的在于克服上述的不足，提供了清除TORQUE计算集群计算节点中残留进程的方法及系统，以解决残留进程对计算节点和TORQUE计算集群的影响，保证任务的快速执行。The purpose of the present invention is to overcome the above-mentioned deficiencies, and provide a method and system for clearing residual processes in computing nodes of TORQUE computing clusters, so as to solve the impact of residual processes on computing nodes and TORQUE computing clusters, and ensure fast execution of tasks.

为了实现上述目的，本发明采用以下技术方案：In order to achieve the above object, the present invention adopts the following technical solutions:

一种清除TORQUE计算集群计算节点中残留进程的方法，包括以下步骤：A method for clearing residual processes in TORQUE computing cluster computing nodes, comprising the following steps:

步骤1：从用户名列表文件中获取所有用户名，存入用户列表；Step 1: Obtain all user names from the user name list file and store them in the user list;

步骤2：获取当前所有计算节点中运行任务的用户名，存入关系数组；Step 2: Obtain the usernames of the running tasks in all current computing nodes and store them in the relational array;

步骤3：为每个计算节点建立一个以计算节点名为文件名的文件，文件里存入当前在该计算节点没有任务执行的用户名；Step 3: Create a file with the name of the computing node for each computing node, and store the user name in the file that currently has no task execution on the computing node;

步骤4：为每个计算节点删除残留进程；Step 4: Delete residual processes for each computing node;

步骤5：自动更新用户名列表文件。Step 5: Automatically update the user name list file.

优选地，在所述步骤1之前还包括：Preferably, before said step 1, it also includes:

为TORQUE计算集群创建用户名列表文件。Create a username list file for the TORQUE compute cluster.

优选地，在所述步骤4之后还包括：Preferably, after said step 4, it also includes:

记录每个计算节点删除残留进程的状况。Record the status of each computing node to delete residual processes.

清除TORQUE计算集群计算节点中残留进程的系统，包括：Clear the system of residual processes in the computing nodes of the TORQUE computing cluster, including:

第一获取模块，用于从用户名列表文件中获取所有用户名，存入用户列表；The first obtaining module is used to obtain all user names from the user name list file and store them in the user list;

第二获取模块，用于获取当前所有计算节点中运行任务的用户名，存入关系数组；The second obtaining module is used to obtain user names of running tasks in all current computing nodes and store them in a relational array;

文件建立模块，用于为每个计算节点建立一个以计算节点名为文件名的文件，文件里存入当前在该计算节点没有任务执行的用户名；The file building module is used to create a file named after the computing node for each computing node, and the file is stored in the user name that currently has no task execution on the computing node;

残留进程删除模块，用于为每个计算节点删除残留进程；A residual process deletion module is used to delete residual processes for each computing node;

更新模块，用于自动更新用户名列表文件。Update module for automatically updating username list files.

优选地，还包括：Preferably, it also includes:

创建模块，用于为TORQUE计算集群创建用户名列表文件。Create a module for creating username list files for TORQUE compute clusters.

优选地，还包括：Preferably, it also includes:

记录模块，用于记录每个计算节点删除残留进程的状况。The recording module is used to record the state of deleting residual processes of each computing node.

与现有技术相比，本发明具有的有益效果：Compared with the prior art, the present invention has the beneficial effects:

本发明通过从用户名列表文件中获取所有用户名、并存入用户列表，然后获取当前所有计算节点中运行任务的用户名、并存入关系数组，然后为每个计算节点建立一个以计算节点名为文件名的文件，文件里存入当前在该计算节点没有任务执行的用户名，最后为每个计算节点删除残留进程、自动更新用户名列表文件的方式，删除各TORQUE计算集群各计算节点的残留进程，以此消除残留进程对计算任务的影响，使得计算任务在预定的时间内完成，规范了集群的使用。The present invention obtains all user names from the user name list file and stores them in the user list, then obtains the user names of running tasks in all current computing nodes and stores them in a relational array, and then establishes a computing node for each computing node A file named file name, which stores the user name that currently has no task execution on the computing node, and finally deletes the residual process for each computing node, automatically updates the user name list file, and deletes each computing node of each TORQUE computing cluster Residual processes, so as to eliminate the impact of residual processes on computing tasks, so that computing tasks can be completed within the scheduled time, and the use of clusters is standardized.

附图说明Description of drawings

图1为本发明清除TORQUE计算集群计算节点中残留进程的方法的基本流程示意图之一。FIG. 1 is one of the basic flow diagrams of the method for clearing residual processes in computing nodes of a TORQUE computing cluster according to the present invention.

图2为本发明清除TORQUE计算集群计算节点中残留进程的方法的基本流程示意图之二。FIG. 2 is the second schematic flow diagram of the method for clearing residual processes in the computing nodes of the TORQUE computing cluster according to the present invention.

图3为本发明清除TORQUE计算集群计算节点中残留进程的系统的结构示意图之一。FIG. 3 is one of the structural schematic diagrams of the system for clearing residual processes in the computing nodes of the TORQUE computing cluster according to the present invention.

图4为本发明清除TORQUE计算集群计算节点中残留进程的系统的结构示意图之二。FIG. 4 is the second schematic structural diagram of the system for clearing residual processes in the computing nodes of the TORQUE computing cluster according to the present invention.

具体实施方式Detailed ways

为了便于理解，对本发明的具体实施方式中出现的部分名词作以下解释说明：For ease of understanding, the following explanations are given to some nouns that appear in the specific embodiments of the present invention:

TORQUE计算集群：一种常见的高性能计算集群作业管理系统。TORQUE computing cluster: a common high-performance computing cluster job management system.

管理节点：用于管理和分配计算资源的节点，一般集群只有一个。Management node: a node used to manage and allocate computing resources, generally there is only one cluster.

计算节点：在集群中主要用于计算的节点，一般集群有很多个。Computing node: The node that is mainly used for computing in the cluster, generally there are many clusters.

残留进程：管理节点中查不到的计算进程。Residual process: the computing process that cannot be found in the management node.

下面结合附图和具体的实施例对本发明做进一步的解释说明：The present invention will be further explained below in conjunction with accompanying drawing and specific embodiment:

实施例一：Embodiment one:

如图1所示，本发明的一种清除TORQUE计算集群计算节点中残留进程的方法，包括以下步骤：As shown in Figure 1, a kind of method of clearing the residual process in the TORQUE computing cluster computing node of the present invention comprises the following steps:

步骤S101：从用户名列表文件中获取所有用户名，存入用户列表；Step S101: Obtain all user names from the user name list file and store them in the user list;

步骤S102：获取当前所有计算节点中运行任务的用户名，存入关系数组；Step S102: Obtain the usernames of the running tasks in all current computing nodes and store them in the relational array;

步骤S103：为每个计算节点建立一个以计算节点名为文件名的文件，文件里存入当前在该计算节点没有任务执行的用户名；Step S103: Create a file named after the computing node for each computing node, and store the user name in the file that currently has no task execution on the computing node;

步骤S104：为每个计算节点删除残留进程；Step S104: delete residual processes for each computing node;

步骤S105：自动更新用户名列表文件。Step S105: Automatically update the user name list file.

实施例二：Embodiment two:

如图2所示，本发明的另一种清除TORQUE计算集群计算节点中残留进程的方法，包括以下步骤：As shown in Figure 2, another method of clearing the residual process in the TORQUE computing cluster computing node of the present invention comprises the following steps:

步骤S201：为TORQUE计算集群创建用户名列表文件user.list。Step S201: Create a user name list file user.list for the TORQUE computing cluster.

步骤S202：从用户名列表文件user.list中获取所有用户名，存入用户列表uli；Step S202: Obtain all user names from the user name list file user.list and store them in the user list uli;

例如，假设系统中有10位用户：For example, suppose there are 10 users in the system:

uli=[user1，user2，user3，user4，user5，user6，user7，user8，user9，user10] ；uli=[user1, user2, user3, user4, user5, user6, user7, user8, user9, user10];

步骤S203：获取当前所有计算节点中运行任务的用户名情况，存入关系数组idic，格式为idic[节点名]=[用户列表]；Step S203: Obtain the user names of the running tasks in all current computing nodes, and store them in the relational array idic in the format of idic[node name]=[user list];

例如，假设TORQUE计算集群有16个计算节点，分别命名为节点1、节点2……节点16，以节点1、节点2、节点16为例：For example, assume that the TORQUE computing cluster has 16 computing nodes, which are named node 1, node 2... node 16, taking node 1, node 2, and node 16 as examples:

idic[node1]=[user1，user3，user8]；idic[node1] = [user1, user3, user8];

idic[node2]=[user1，user4]；idic[node2] = [user1, user4];

idic[node16]=[user5，user6]。idic[node16] = [user5, user6].

步骤S204：为每个计算节点建立一个以计算节点名为文件名的文件，文件里存入当前在该计算节点没有任务执行的用户名；Step S204: Create a file named after the computing node for each computing node, and store the user name in the file that currently has no task execution on the computing node;

例如：E.g:

node1文件中存入：user2，user4，user5，user6，user7，user9，user10；Store in the node1 file: user2, user4, user5, user6, user7, user9, user10;

node2文件中存入：user2，user3，user5，user6，user7，user8，user9，user10；Store in the node2 file: user2, user3, user5, user6, user7, user8, user9, user10;

node16的文件中存入：user1，user2，user3，user4，user7，user8，user9，user10。Stored in the file of node16: user1, user2, user3, user4, user7, user8, user9, user10.

步骤S205：根据每个计算节点文件中当前在该计算节点没有任务执行的用户名，为每个计算节点删除残留进程；Step S205: According to the user name in each computing node file that currently has no task execution on the computing node, delete the residual process for each computing node;

例如：E.g:

node1下删除user2，user4，user5，user6，user7，user9，user10的所有进程；Delete all processes of user2, user4, user5, user6, user7, user9, user10 under node1;

node2下删除user2，user3，user5，user6，user7，user8，user9，user10的所有进程；Delete all processes of user2, user3, user5, user6, user7, user8, user9, user10 under node2;

node16下删除user1，user2，user3，user4，user7，user8，user9，user10所有进程。Delete all processes of user1, user2, user3, user4, user7, user8, user9, user10 under node16.

步骤S206：记录每个计算节点删除残留进程的状况，把每个计算节点删除残留进程的状况写入now.list，直至删除所有残留进程。Step S206: Record the status of deleting residual processes of each computing node, and write the status of deleting residual processes of each computing node into now.list until all residual processes are deleted.

步骤S207：如果管理员新添加了用户，并且该用户处于执行任务状态，则自动更新用户名列表文件user.list。Step S207: If the administrator newly adds a user, and the user is in the state of executing a task, automatically update the user name list file user.list.

实施例三：Embodiment three:

如图3所示，本发明的一种清除TORQUE计算集群计算节点中残留进程的系统，包括：As shown in Fig. 3, a kind of system of the present invention clears the residual process in the computing node of TORQUE computing cluster, comprises:

第一获取模块301，用于从用户名列表文件中获取所有用户名，存入用户列表；The first obtaining module 301 is used to obtain all user names from the user name list file and store them in the user list;

第二获取模块302，用于获取当前所有计算节点中运行任务的用户名，存入关系数组；The second obtaining module 302 is used to obtain the usernames of running tasks in all current computing nodes and store them in a relational array;

文件建立模块303，用于为每个计算节点建立一个以计算节点名为文件名的文件，文件里存入当前在该计算节点没有任务执行的用户名；The file creation module 303 is used to set up a file with the name of the computing node as the file name for each computing node, and store the user name currently having no task execution in the computing node in the file;

残留进程删除模块304，用于为每个计算节点删除残留进程；A residual process deletion module 304, configured to delete the residual process for each computing node;

更新模块305，用于自动更新用户名列表文件。An update module 305, configured to automatically update the user name list file.

实施例四：Embodiment four:

如图4所示，本发明的另一种清除TORQUE计算集群计算节点中残留进程的系统，包括：As shown in Figure 4, another system of the present invention that removes residual processes in the computing nodes of the TORQUE computing cluster includes:

创建模块401，用于为TORQUE计算集群创建用户名列表文件。The creation module 401 is used for creating a user name list file for the TORQUE computing cluster.

第一获取模块402，用于从用户名列表文件中获取所有用户名，存入用户列表；The first obtaining module 402 is used to obtain all user names from the user name list file and store them in the user list;

第二获取模块403，用于获取当前所有计算节点中运行任务的用户名，存入关系数组；The second obtaining module 403 is used to obtain the usernames of running tasks in all current computing nodes and store them in the relational array;

文件建立模块404，用于为每个计算节点建立一个以计算节点名为文件名的文件，文件里存入当前在该计算节点没有任务执行的用户名；The file creation module 404 is used to set up a file with the computing node name as the file name for each computing node, and store the user name currently having no task execution in the computing node in the file;

残留进程删除模块405，用于为每个计算节点删除残留进程；Residual process deletion module 405, configured to delete residual processes for each computing node;

记录模块406，用于记录每个计算节点删除残留进程的状况。The recording module 406 is configured to record the state of deleting residual processes of each computing node.

更新模块407，用于自动更新用户名列表文件。An update module 407, configured to automatically update the user name list file.

以上所示仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。What is shown above is only a preferred embodiment of the present invention, and it should be pointed out that for those of ordinary skill in the art, some improvements and modifications can also be made without departing from the principle of the present invention. It should be regarded as the protection scope of the present invention.

Claims

1. the method for process is remained in a kind of removing TORQUE computing cluster calculate nodes, it is characterised in that comprise the following steps：

Step 1：All user names are obtained from user name listing file, are stored in user list；

Step 2：The user name of operation task in current all calculate nodes is obtained, is stored in relation array；

Step 3：A file with the entitled filename of calculate node is established for each calculate node, is stored in currently at this in file Calculate node does not have the user name of tasks carrying；

Step 4：Residual process is deleted for each calculate node；

Step 5：Automatically update user name listing file.

2. remaining the method for process in removing TORQUE computing cluster calculate nodes according to claim 1, its feature exists In also including before the step 1：

User name listing file is created for TORQUE computing clusters.

3. remaining the method for process in removing TORQUE computing cluster calculate nodes according to claim 1, its feature exists In also including after the step 4：

Record the situation that each calculate node deletes residual process.

4. based on the clear of the method that process is remained in any described removing TORQUE computing cluster calculate nodes of claim 1-3 Except the system that process is remained in TORQUE computing cluster calculate nodes, it is characterised in that including：

First acquisition module, for obtaining all user names from user name listing file, it is stored in user list；

Second acquisition module, for obtaining the user name of operation task in current all calculate nodes, it is stored in relation array；

File establishes module, for establishing a file with the entitled filename of calculate node for each calculate node, in file Deposit does not currently have the user name of tasks carrying in the calculate node；

Process-kill module is remained, for deleting residual process for each calculate node；

Update module, for automatically updating user name listing file.

5. remaining the system of process in removing TORQUE computing cluster calculate nodes according to claim 4, its feature exists In, in addition to：

Creation module, for creating user name listing file for TORQUE computing clusters.

6. remaining the system of process in removing TORQUE computing cluster calculate nodes according to claim 4, its feature exists In, in addition to：

Logging modle, the situation of residual process is deleted for recording each calculate node.