CN107506261B

CN107506261B - Cascade fault-tolerant processing method suitable for CPU and GPU heterogeneous clusters

Info

Publication number: CN107506261B
Application number: CN201710647664.0A
Authority: CN
Inventors: 姜海; 王忠儒; 李海磊
Original assignee: Beijing Digapis Technology Co ltd
Current assignee: Beijing Digapis Technology Co ltd
Priority date: 2017-08-01
Filing date: 2017-08-01
Publication date: 2020-05-15
Anticipated expiration: 2037-08-01
Also published as: CN107506261A

Abstract

The disclosure relates to a cascade fault-tolerant processing method suitable for CPU and GPU heterogeneous clusters. The method comprises the following steps: the method comprises the steps of establishing a data transmission consistency detection model for detecting data transmission consistency, establishing a data access consistency model for realizing data access consistency between a CPU and a GPU, establishing a data operation result correctness detection model for detecting correctness of a thread data operation result, establishing a service backup model at an application layer for backing up historical records of service operation, and establishing a service operation information backup model at a system layer for backing up operation information of the service operation. Therefore, when the GPU and CPU heterogeneous cluster have non-physical damage faults, service fault positions can be quickly located for business personnel, the state before the fault is extracted, the service is quickly reset, and loss is reduced.

Description

Cascade fault-tolerant processing method suitable for CPU and GPU heterogeneous clusters

Technical Field

The disclosure relates to the technical field of computers, in particular to a cascade fault-tolerant processing method suitable for CPU and GPU heterogeneous clusters.

Background

Since the first Computer came into birth, the development of Computer technology was changing day by day, and the form varies from a personal Computer (personal Computer) to a Super Computer (Super Computer). The requirement of human beings on calculation is endless, and in order to satisfy the requirement of human beings on computing power, the speed of the computer that produces is more and more fast, in 2016, the "Tai lake light" supercomputer "developed in China occupied the first and second names of the world supercomputer rank for the first time, and the operational capability reached 100Pbit (gigabit) magnitude for the first time, embodied our national comprehensive ability. However, at the same time, the super computing power is accompanied by a super complex and huge computer architecture, and the main computing power of "Tianhe second number" is provided by a GPU (Graphics Processing Unit), and the number of processors reaches ten thousand levels in scale.

The increase of the number of processors and the increasing complexity of the architecture lead to the increase of the probability of system failure, and meanwhile, the service time of the applications running on the high-performance computing equipment is very long, for example, the running time of the simulation and the password solving problem is in units of days or even years, so that the fault tolerance problem of the high-performance computing equipment is more prominent, and how to design an efficient and reliable fault tolerance mechanism becomes an urgent problem to be solved in system development.

The GPU and CPU heterogeneous cluster system occupies an important position in the current high-performance computing equipment architecture, and in the ranking of the supercomputer Top500, more than 40% of supercomputer architectures adopt the GPU and CPU architectures, so that the research on the fault tolerance method of the computer with the typical representative architecture has strong practical significance.

The GPU and the CPU heterogeneous cluster are characterized in that computing resources of the GPU and the CPU heterogeneous cluster comprise two types of CPUs and GPUs, and the two types of CPUs and the GPU form multi-layer storage on a hardware structure, wherein the multi-layer storage comprises a CPU-level cache, a memory, shared storage and the like, a GPU-level shared storage (shared memory), a global memory (global memory) and the like. FIG. 1 is a schematic diagram of a data storage hierarchy in a GPU and CPU heterogeneous cluster. As shown in fig. 1, the CPU and GPU heterogeneous cluster system has multiple storage layers, different data transmission modes between layers, and the reliability of data transmission is a guarantee for correct operation of the service.

The parallel mode of the CPU and GPU heterogeneous cluster system comprises CPU-level interprocess parallel, GPU-level block parallel and block-in-block thread parallel, so that the consistency of data among all levels must be ensured. The large-scale data analysis and processing application has high calculation complexity, large scale of used calculation resources, long running time, high requirement on the reliability of the calculation resources, and large data access volume of some application services, and needs large-scale read-write storage equipment; some communication is dense, and the requirement on network transmission of the system is high; some devices occupy large GPU space, are frequent in read-write operation and have higher requirements on GPU stability; some processes are complex, intermediate links are numerous, single-step errors affect the whole service process, and the requirements on process fault tolerance are high. A well designed fault-tolerant mechanism is a guarantee for reliable operation of application calculation.

Disclosure of Invention

In order to overcome the problems in the related art, the invention provides a cascade fault-tolerant processing method suitable for CPU and GPU heterogeneous clusters.

According to a first aspect of the embodiments of the present disclosure, a cascaded fault-tolerant processing method adapted to CPU and GPU heterogeneous clusters is provided, including:

constructing a data transmission consistency detection model for detecting the consistency of data transmission;

constructing a data access consistency model for realizing the consistency of data access between the CPU and the GPU;

constructing a data operation result correctness detection model for detecting the correctness of the thread data operation result;

constructing a service backup model in an application layer for backing up the history of service operation;

and constructing a service operation information backup model at a system layer for backing up operation information of service operation.

For the above method, in a possible implementation manner, the constructing a data transmission consistency detection model includes:

detecting the consistency of data information before transmission and data information after transmission;

detecting the consistency of the check values of the data before transmission and the data after transmission, wherein the check value of the data before transmission and the check value of the data after transmission are determined through the same hash function;

and determining that the data transmission has consistency under the condition that the information of the data before transmission is consistent with that of the data after transmission and the check values of the data before transmission and the data after transmission are consistent.

For the above method, in a possible implementation manner, the building a data access consistency model includes:

mapping a computing task into a plurality of concurrently executable threads, wherein the plurality of concurrently executable threads form respective thread blocks;

the consistency of thread data in the thread blocks is realized by combining a shared memory technology, a thread synchronization fence technology and an atomic operation technology.

For the above method, in a possible implementation manner, the constructing a data operation result correctness detection model includes:

distributing the same operation task to three operation modules with the same scale and different execution positions for respective operation, and respectively obtaining three operation results;

determining the operation result to be a correct operation result under the condition that the three operation results are all equal;

determining that any two operation results in the three operation results are equal to each other, wherein the equal operation results are correct operation results;

and under the condition that the three operation results are not equal, re-executing the operation task.

For the above method, in a possible implementation manner, the building a service backup model at an application layer includes:

periodically backing up service information, detecting the running state of the service at a fixed time interval on a main flow of application, and backing up a history file of the service running in the time interval;

and aperiodic backup service information, in the branch flow of the application, checking the service operation state by taking the time required by each service as an interval, and backing up the history file of the service operation in the time interval.

For the above method, in a possible implementation manner, the building a service job information backup model at a system layer includes:

and simultaneously recording and redundantly backing up the operation information of the service operation by using the database and the file system.

For the above method, in a possible implementation manner, the recording and redundant backup of the job information of the service operation by using the database and the file system at the same time includes: under the condition of abnormal operation caused by normal maintenance and power failure, before shutdown and maintenance of the system, the information of the service operation state database, the periodic backup service information and the aperiodic backup service information are manually reserved.

For the above method, in a possible implementation manner, the recording and redundant backup of the job information of the service operation by using the database and the file system at the same time includes: and in the case of job abnormity caused by unexpected power failure, keeping periodic backup service information, aperiodic backup service information, job state database information and job state log of the time point of unexpected power failure.

For the above method, in a possible implementation manner, the recording and redundant backup of the job information of the service operation by using the database and the file system at the same time includes: and under the condition that the data error causes operation abnormity, suspending service operation, reserving a history file of service operation before suspension, correcting the data with the error, and recovering the service after correction.

For the above method, in a possible implementation manner, the recording and redundant backup of the job information of the service operation by using the database and the file system at the same time includes: and under the condition that the operation is abnormal due to misoperation of personnel, the misoperation is stopped, and the service is continuously operated from the backup time point of the current periodic backup service information of the system.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: the method comprises the steps of constructing a data transmission consistency detection model for detecting the consistency of data transmission, constructing a data access consistency model for realizing the consistency of data access between a CPU and a GPU, constructing a data operation result correctness detection model for detecting the correctness of a thread data operation result, constructing a service backup model for backing up the history of service operation on an application layer, and constructing a service operation information backup model for backing up the operation information of the service operation on a system layer. Therefore, when the GPU and CPU heterogeneous cluster have non-physical damage faults, service fault positions can be quickly located for business personnel, the state before the fault is extracted, the service is quickly reset, and loss is reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a schematic diagram of a data storage hierarchy in a GPU and CPU heterogeneous cluster.

Fig. 2 is a schematic diagram illustrating consistency check of data transmission in a cascading fault-tolerant processing method adapted to CPU and GPU heterogeneous clusters according to an exemplary embodiment.

Fig. 3 is a schematic diagram illustrating a processing flow of job exception caused by normal shutdown, overhaul and power failure in a cascading fault-tolerant processing method adapted to CPU and GPU heterogeneous clusters according to an exemplary embodiment.

Fig. 4 is a schematic flow chart illustrating a processing procedure of job exception caused by unexpected power failure in a cascading fault-tolerant processing method adapted to CPU and GPU heterogeneous clusters according to an exemplary embodiment.

Fig. 5 is a schematic diagram illustrating a cascaded fault-tolerant processing method adapted to CPU and GPU heterogeneous clusters according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The present invention will be described in detail below with reference to specific examples. The following examples will aid those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

First step of: and constructing a data transmission consistency detection model for detecting the consistency of data transmission.

As an example of this embodiment, constructing a data transmission consistency detection model includes: the consistency of data information before transmission and data information after transmission is detected, and the consistency of check values of the data before transmission and the data after transmission is detected, wherein the check values of the data before transmission and the check values of the data after transmission are determined through the same Hash Function (Hash Function), and the consistency of data transmission is determined under the condition that the data information before transmission and the data information after transmission are consistent and the check values of the data before transmission and the data after transmission are consistent.

Fig. 2 is a schematic diagram illustrating consistency check of data transmission in a cascade fault-tolerant processing method adapted to CPU and GPU heterogeneous clusters according to an exemplary embodiment, where as shown in fig. 2, the two comparisons may be data information comparison of data itself and check value comparison after hash calculation of the data, if both comparison results are equal, it may be determined that the data transmission is consistent before and after the data transmission, and if any one of the comparison results is not equal, it may be determined that the data transmission is inconsistent before and after the data transmission, and the data is retransmitted. For example, for the data0 and the data1, the data information of the data0 and the data1 is determined using the LEN function, the hash check value of the data0 and the data1 is determined using the same hash function, and if LEN (data0) ═ LEN (data1) and hash (data0) ═ hash (data1), it can be determined that the data before and after transmission are identical. If LEN (data0) ≠ LEN (data1), or hash (data0) ≠ hash (data1) or LEN (data0) ≠ LEN (data1) and hash (data0) ≠ hash (data1), it can be determined that the data transmission is inconsistent before and after the data transmission, and the data is retransmitted.

Second step of: and constructing a data access consistency model for realizing the consistency of data access between the CPU and the GPU.

As an example of this embodiment, constructing the data access consistency model includes:

the computing task is mapped into a plurality of concurrently executable threads forming respective thread blocks (blocks). The thread blocks cannot communicate with each other, and threads in the same thread block can communicate with each other.

Thread communication in the thread blocks is realized by combining a shared memory technology, a thread synchronization fence technology and an atomic operation technology.

In this example, shared memory technology (shared memory), a read-write memory that can be accessed by all threads in the same thread block, the lifetime is the lifetime of the thread block. The embodiment can realize the storage of multiple threads in the thread block by using a shared storage technology.

In this example, a thread synchronization fence is a multi-thread synchronization technique. When the threads need to wait for other threads, the threads can be operated to the fence, and when all the threads arrive at the fence, the fence is cancelled, so that the multi-thread coordination and synchronization are realized. The embodiment can realize the synchronization of multiple threads in the thread blocks through the thread synchronization barrier technology.

In this example, atomic operation technology (atomic operation) refers to an operation that is not interrupted by a thread scheduling mechanism, and once started, runs to the end without switching to another thread in the middle. The embodiment realizes the independence of thread running in the thread blocks and among the thread blocks through the atomic operation technology.

It should be noted that, those skilled in the art may select an appropriate combination manner as needed to combine the shared memory technology, the thread synchronization fence technology, and the atomic operation technology to achieve consistency of thread data within the thread block, which is not limited herein.

The third step: and constructing a data operation result correctness detection model for detecting the correctness of the thread data operation result.

A voting mechanism may be employed to determine the correctness of the calculation. The same operation task is distributed to three operation modules with the same scale and different execution positions for operation respectively, three operation results are obtained respectively, the operation result is determined to be the correct operation result under the condition that the three operation results are equal, the equal operation result is determined to be the correct operation result under the condition that any two operation results in the three operation results are equal, and the operation task is executed again under the condition that the three operation results are unequal.

In general, the probability of a system logic error increases with an increasing number of processor threads, e.g., the system calculates 1+1 with a result ≠ 2. System logic errors can produce devastating results for services that require extreme data correctness (e.g., cryptographic solutions).

For example, for an important operation in a service, three calculation results, i.e., joba 0, joba 1 and joba 2, are calculated by using a module 1, a module 2 and a module 3 respectively for the same task, wherein the module 1, the module 2 and the module 3 have the same scale and different execution positions. If the job0 is job1 is job2, the result is considered to be correct, if the job0 is job1 not equal to job2, the job0 and the job1 are considered to be correct, the job0 and the job1 are reserved, the job2 is removed, and if the job0 not equal to job1 not equal to job2, the calculation of the link is carried out again. By the method, the correctness of the data result is checked.

The fourth step: and constructing a service backup model at an application layer for backing up the history of service operation.

The construction of the service backup model at the application layer comprises the following steps: periodic backup service information and aperiodic backup service information.

And periodically backing up the service information, detecting the running state of the service at fixed time intervals in the main flow of the application, and backing up the history file of the service running. For example, if the fixed time interval is 2 hours, the service operation state is detected every 2 hours, and a history file of the service operation is backed up. All history files generated by the service operation can be saved, and part of history files generated by the service operation within the 2 hours can also be saved.

And the non-periodic backup service information is used for checking the service running state by taking the time required by each service as an interval on the branch flow of the application, and backing up the history file of the service running in the time interval.

For example, if the time required by the first service is 1 hour, all history files of the first service operation may be backed up after the first service operation is finished after 1 hour. For another example, if the time required by the second service is 3 hours, all the history files of the second service operation may be backed up after the second service operation is finished after 3 hours.

The periodic backup service information and the aperiodic backup service information are mutually supplemented, so that the latest correct state of each process (namely the state of the process and the sub-process reserved before the fault occurs) can be quickly found after the fault occurs, and the system loss is reduced.

The fifth step: and constructing a service operation information backup model at a system layer for backing up operation information of service operation.

The method for constructing the service operation information backup model at the system layer comprises the following steps: and simultaneously recording and redundantly backing up the operation information of the service operation by using the database and the file system.

As an example of the embodiment, the database recording and redundantly backing up the job information of the service operation simultaneously with the file system includes: and monitoring system resources.

High availability of computing resources is a guarantee that the computation is correct. GPU resource detection is realized by detecting indexes such as GPU access performance, CPU and GPU transmission performance and the like. And performing health check on GPU resources to be selected, and screening out computing resources with good health states, thereby ensuring the quality at the source of computing.

In the embodiment, a method of combining a database and a file system is adopted to monitor the resource state and the job state, and perform redundant backup on job information, including monitoring and backing up the states of resources such as a GPU and a CPU, the state of a job running process and the information of historical jobs, so as to ensure that the latest correct state before a fault can be provided when a service fault occurs.

As another example of the present embodiment, the database recording and redundantly backing up the job information of the service operation simultaneously with the file system includes: and monitoring the working state.

The accurate acquisition of the operation running state is a key link of operation monitoring. For an application program with a complex flow, the correct execution of the former step is the premise of the submission of the latter step, and the running result of the former step determines the circulation direction of the next operation. The system acquires accurate information of the job state through monitoring and verifying the job information of the application program, wherein the job information comprises information such as a job number, an application name, the type and the number of computing resources, an application input file, an application output file, the current state of the job and the like.

As another example of the present embodiment, the database recording and redundantly backing up the job information of the service operation simultaneously with the file system includes: and processing the operation exception.

How to handle the operation exception is an important link of the automatic operation of the system. The operation abnormity causes mainly comprise two reasons, namely data is not matched with application, and computing resources are failed. For the problem that data is not matched with application, the problem can be solved only by a manual processing mode, and for the problem that computing resources have faults, the system can detect the corresponding computing resources, remove problem nodes, migrate jobs and recover operation. When a fault such as hardware downtime occurs, the operation exception processing can automatically recover the operation according to the operation information before the fault.

Among them, the job abnormality may be caused by the following reasons: normal maintenance outage, unexpected outage, data errors, personnel misoperation.

In one possible implementation, the processing flow of the operation exception caused by the normal maintenance outage comprises: the service operation is normal, and before the system is shut down and maintained, various parameter information of the service operation is manually reserved, wherein the parameter information mainly comprises service operation state database information, periodic backup service information and aperiodic backup service information. Fig. 3 is a schematic diagram illustrating a processing flow of job exception caused by normal shutdown, overhaul and power failure in a cascading fault-tolerant processing method adapted to CPU and GPU heterogeneous clusters according to an exemplary embodiment. As shown in fig. 3, after the system is restarted and the system status check is completed, the parameter information parameters before shutdown are read in, and the service is continuously run.

In one possible implementation, the processing flow of the job exception caused by the unexpected power failure includes: and keeping the periodic backup service information and the aperiodic backup service information of the time point of the unexpected power failure. Fig. 4 is a schematic flow chart illustrating a processing procedure of job exception caused by unexpected power failure in a cascading fault-tolerant processing method adapted to CPU and GPU heterogeneous clusters according to an exemplary embodiment. As shown in fig. 4, when power is unexpectedly turned off, periodic backup service information and aperiodic backup service information, and job status database information and job status log are retained at the time point of unexpected power failure, when the system is restarted and status detection is performed, job information detection of historical unfinished jobs is performed at the same time, and when parameters are read, the periodic backup service information and aperiodic backup service information and job status database information saved by the system are read as parameters, and then services are restored. If the job status database information is incomplete, the system will automatically restore the job log information from the job status log information to the job status database information.

In one possible implementation, the processing flow of the job exception caused by the data error includes:

when an error occurs in data transmission, the service is suspended, various states before the service is suspended are reserved, the data with the error is corrected, and the service is recovered after the correction.

When an error occurs in data reading, the service is suspended, various states before the service is suspended are reserved, the data with the error is corrected, and the service is recovered after the correction.

When the calculation error occurs, the service is suspended, various states before the service is suspended are reserved, error data are recalculated, and the service is recovered under the condition that the calculation result of the data is correct.

In one possible implementation manner, the processing flow of the job exception caused by the misoperation of the personnel comprises the following steps:

the misoperation is suspended, and the service continues to run from the suspension point.

In another possible implementation manner, the processing flow of the job exception caused by the misoperation of the personnel includes:

and (4) stopping the misoperation, and continuing the operation of the service from the backup point of the last periodic backup service information.

As an example of this embodiment, the database and file system combination method includes: redundant backup of job information

The core of the system is the job. The job information includes job number, application name, calculation resource type and number, application input file, application output file, job current state, and the like. The information is crucial to the management and control of the whole flow of the operation. The system adopts a database and a file system to carry out double backup on the operation information, the database is taken as a main part, and the file system is taken as an auxiliary part. The redundant backup of the operation information effectively ensures the safety of the operation information.

In one possible implementation, the job information redundancy backup includes:

submitting the job to enter a job queue;

establishing operation related information in a database application information table and an operation object table;

establishing a job information backup file in a file system;

in the operation of the operation, periodically updating a database operation object table;

updating the job information backup file according to the backup time interval in the operation of the job;

when the operation object table is invalid, restoring the operation state table through the application information table and the operation information backup file;

and when the database application information table and the job object table fail simultaneously, the file is restored through the job information backup.

Fig. 5 is a schematic diagram illustrating a cascaded fault-tolerant processing method adapted to CPU and GPU heterogeneous clusters according to an exemplary embodiment. As shown in fig. 5, in this embodiment, a cascade reconstruction fault-tolerant mechanism in which a data transmission consistency detection model, a data access consistency model, a data operation result correctness detection model, an application layer service backup model, and a system layer service operation information backup model are linked with each other is designed as a whole, where the data transmission consistency detection model, the data access consistency model, and the data operation result correctness detection model ensure data transmission correctness of heterogeneous resources and data consistency in a thread block, and are the basis of fault tolerance of the application layer service backup model and the system layer service operation information backup model; the service backup model fault tolerance of the application layer provides support for the service operation information backup model fault tolerance of the system layer, and the service operation information backup model fault tolerance of the system layer provides basis for the service backup model fault tolerance of the application layer through accurate detection of operation states, so that the service full life cycle is guaranteed.

It should be noted that the above-mentioned "first", "second", "third" and "fourth" are only used for convenience and distinction, and do not represent the order between the respective steps.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A cascade fault-tolerant processing method suitable for CPU and GPU heterogeneous clusters is characterized by comprising the following steps:

a service operation information backup model is constructed at a system layer and is used for backing up operation information of service operation,

wherein the building of the service backup model at the application layer comprises:

periodically backing up service information, detecting the running state of a service at fixed time intervals in the main process of application, and backing up a history file of the service running;

aperiodic backup service information, in the branch flow of the application, checking the service operation state with the time required by each service as the interval, and backing up the history file of the service operation in the time interval,

the building of the service operation information backup model at the system layer comprises the following steps:

2. The method of claim 1, wherein the constructing the data transmission consistency detection model comprises:

3. The method of claim 1, wherein building the data access consistency model comprises:

4. The method of claim 1, wherein constructing the data operation result correctness detection model comprises:

5. The method of claim 1, wherein the recording and redundant backup of job information of the service run by the database and the file system simultaneously comprises: under the condition of abnormal operation caused by normal maintenance power failure, before shutdown maintenance of the system, the service operation state database information, the periodic backup service information and the aperiodic backup service information are manually reserved.

6. The method of claim 1, wherein the recording and redundant backup of job information of the service run by the database and the file system simultaneously comprises: and in the case of job abnormity caused by unexpected power failure, keeping periodic backup service information, aperiodic backup service information, job state database information and job state log of the time point of unexpected power failure.

7. The method of claim 1, wherein the recording and redundant backup of job information of the service run by the database and the file system simultaneously comprises: and under the condition that the data error causes operation abnormity, suspending service operation, reserving a history file of service operation before suspension, correcting the data with the error, and recovering the service after correction.

8. The method of claim 1, wherein the recording and redundant backup of job information of the service run by the database and the file system simultaneously comprises: and under the condition that the operation is abnormal due to misoperation of personnel, the misoperation is stopped, and the service is continuously operated from the backup time point of the current periodic backup service information of the system.