[go: up one dir, main page]

CN119576563A - Method for creating and restoring a GPU application checkpoint, device, equipment, and storage medium - Google Patents

Method for creating and restoring a GPU application checkpoint, device, equipment, and storage medium Download PDF

Info

Publication number
CN119576563A
CN119576563A CN202411701575.6A CN202411701575A CN119576563A CN 119576563 A CN119576563 A CN 119576563A CN 202411701575 A CN202411701575 A CN 202411701575A CN 119576563 A CN119576563 A CN 119576563A
Authority
CN
China
Prior art keywords
gpu
information
memory
checkpoint
program
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411701575.6A
Other languages
Chinese (zh)
Inventor
尹浩
江政雄
高鹏军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Cloud Technology Co Ltd
Original Assignee
China Telecom Cloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Cloud Technology Co Ltd filed Critical China Telecom Cloud Technology Co Ltd
Priority to CN202411701575.6A priority Critical patent/CN119576563A/en
Publication of CN119576563A publication Critical patent/CN119576563A/en
Pending legal-status Critical Current

Links

Landscapes

  • Retry When Errors Occur (AREA)

Abstract

The application relates to a method for creating a check point of a GPU application program, a recovery method, a creation device, equipment and a storage medium, wherein the creation method comprises the steps of setting GPU environment variables in a client based on requirements of the GPU application program; the method comprises the steps of configuring a GPU application program, calling a compiled dynamic library, executing programs related to CPU operation in the GPU application program through a CPU of a client, executing programs related to GPU operation in the GPU application program through remote procedure call, recording first execution state information of the CPU in the client, recording second execution state information of the GPU in a server, and determining check point information based on the first execution state information and the second execution state information. The method for creating the check point can reduce the storage time and space expenditure of the check point, and ensure that the creation of the check point does not significantly influence the overall performance of the GPU application program in a complex computing task.

Description

GPU application program check point creation method, restoration method, creation device, equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method for creating, recovering, creating device, apparatus, and storage medium for a checkpoint of a GPU application.
Background
With the development of GPU (Graphics Processing Unit ) technology, GPUs have been widely used in the fields of high-performance computing (HPC), artificial Intelligence (AI), machine Learning (ML), deep Learning (DL), scientific computing, and image processing. Compared to a traditional CPU (Central Processing Unit ), a GPU has powerful parallel computing capabilities, and is capable of processing a large number of data operations simultaneously, thus exhibiting significant performance advantages in processing graphics-intensive and computation-intensive tasks. However, as the complexity of GPU applications increases, long-running computing tasks may be affected by unpredictable events such as hardware failures, software errors, power outages, etc. during execution, resulting in computing interruptions. This not only wastes a significant amount of computing resources, but may also cause data loss and process termination.
In order to guarantee the continuity and reliability of computing tasks, GPU applications need to have an efficient fault tolerance mechanism, with checkpoint (checkpoint) and recovery (recovery) techniques becoming key approaches to solving such problems. Checkpointing and restoration techniques are fault tolerant techniques that are widely used in distributed systems and high performance computing. It can restart from the latest checkpoint when the program is interrupted by a fault by periodically saving the current state of the program as a checkpoint file, thereby avoiding running the entire program from scratch. In the field of CPU, checkpoint recovery technology is relatively mature, but how to perform checkpoint recovery on GPU applications is still a problem to be solved.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a method for creating a checkpoint of a GPU application, a method for restoring the checkpoint, a creating apparatus, a device, and a storage medium, which are capable of creating and restoring the checkpoint of the GPU application.
In a first aspect, the present application provides a method for creating a checkpoint of a GPU application. The method comprises the steps of setting GPU environment variables in a client based on requirements of a GPU application program, configuring the GPU application program to call a compiled dynamic library, executing programs related to CPU operation in the GPU application program through a CPU of the client, executing programs related to GPU operation in the GPU application program through remote procedure call, wherein the remote procedure call is jointly executed by a server connected with the client, recording first execution state information of the CPU in the client, recording second execution state information of the GPU in the server, and determining check point information based on the first execution state information and the second execution state information.
According to the method for creating the GPU application program check point, the GPU environment variable is set in the client according to the requirements of the GPU application program, the GPU application program is then configured to call the compiled dynamic library so that the dynamic library can intercept and process related API calls, when the GPU application program runs, the CPU of the client executes the program related to the CPU running in the GPU application program, and when the program related to the GPU running in the GPU application program is executed through the remote procedure call, the first execution state information of the CPU of the client is recorded, the second execution state information of the GPU of the server is recorded and used as check point information, and therefore the creation of the check point is completed. The method for creating the check point can reduce the storage time and space expenditure of the check point, and ensure that the creation of the check point does not significantly influence the overall performance of the GPU application program in a complex computing task.
In one embodiment, the step of recording the second execution state information of the GPU in the server includes obtaining operation information of the GPU application program on the GPU, wherein the operation information includes memory allocation, thread synchronization, kernel execution state, obtaining internal information of the GPU through a debugger API, wherein the internal information includes a synchronization state, a program counter and a call stack, obtaining memory information, wherein the memory information includes static memory information and dynamic memory information, and taking the operation information, the internal information and the memory information as the second execution state information.
In one embodiment, the step of obtaining the memory information includes obtaining the static memory information based on static memory area information and obtaining the dynamic memory information based on dynamic memory area information.
In one embodiment, the step of obtaining the dynamic memory information based on the dynamic memory area information includes using API call information related to the dynamic memory as the dynamic memory area information.
The application further provides a method for recovering the check point of the GPU application program, which comprises loading check point information, wherein the check point information is created by the method for creating the check point of the GPU application program according to the embodiment of the first aspect, and the GPU application program is continuously executed based on the recovery point and the check point information.
According to the method for recovering the GPU application program checkpoints, the stored checkpoint information is loaded, and the GPU application program is continuously executed according to the recovery point and the checkpoint information, so that the GPU application program can be quickly recovered and continuously executed when hardware faults, software errors or other unpredictable events are encountered, and the reliability and the stability of a system are remarkably improved.
In one embodiment, the method further comprises determining the resume point by modifying a jump instruction in the GPU application.
In a third aspect, the application further provides a device for creating the check point of the GPU application program. The device comprises an environment setting module, a dynamic library updating module, a program running module, a remote calling module and a checkpoint determining module, wherein the environment setting module is used for setting GPU environment variables in a client based on requirements of a GPU application program, the dynamic library updating module is used for configuring the GPU application program to call a compiled dynamic library, the program running module is used for executing programs related to CPU running in the GPU application program through a CPU of the client, the remote calling module is used for executing programs related to GPU running in the GPU application program through remote procedure calls, the remote procedure calls are jointly executed by a server connected with the client, the first information recording module is used for recording first execution state information of the CPU in the client, the second information recording module is used for recording second execution state information of the GPU in the server, and the checkpoint determining module is used for determining checkpoint information based on the first execution state information and the second execution state information.
In one embodiment, the second information recording module is further configured to obtain operation information of the GPU application on the GPU, where the operation information includes memory allocation, thread synchronization, kernel execution state, and internal information of the GPU is obtained through a debugger API, where the internal information includes a synchronization state, a program counter, and a call stack, and the internal information includes static memory information and dynamic memory information, and the operation information, the internal information, and the memory information are used as the second execution state information.
In a fourth aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of the above method when the processor executes the computer program.
In a fifth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the above method.
Drawings
FIG. 1 is an application environment diagram of a creation method and a restoration method in one embodiment;
FIG. 2 is a flow diagram of a method of creation in one embodiment;
FIG. 3 is a flowchart illustrating recording second execution status information according to an embodiment;
FIG. 4 is a flow diagram of a recovery method in one embodiment;
FIG. 5 is a flow diagram of a creation method and a restoration method in one embodiment;
FIG. 6 is a schematic diagram of a module for creating an apparatus in one embodiment;
Fig. 7 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
First, several nouns involved in the present application are parsed:
APIs (Application Programming Interface, application Programming interfaces), which are predefined functions that are intended to provide applications and developers the ability to access a set of routines based on certain software or hardware without having to access source code or understand the details of the internal operating mechanisms.
RPC (Remote Procedure Call ) is a technique for requesting services from a remote computer program over a network that allows the program to call a procedure or function on another computer as if it were a local function.
The CUDA API is a parallel computing platform and programming model which are developed by NVIDIA, so that a developer can utilize the GPU of NVIDIA to carry out general computation.
As described in the background, since the execution time of the GPU application running for a long period of time is generally different from several hours to several days, the application currently being executed can be dynamically changed, and flexibility is significantly improved. GPU applications running for long periods of time are inherently more susceptible to statistical failures because the longer the execution time, the greater the likelihood of an error in operation. Such as memory failures and hardware errors, resulting in the need to restart the GPU application. The checkpoint technique is capable of suspending execution of the application without losing computational progress, saving the execution state of the application, so as to resume computational progress at a subsequent point in time. However, the architecture of the GPU is significantly different from that of the CPU, and the multithreaded parallel execution model and the large-scale data processing capability of the GPU place higher demands on the creation and recovery of checkpoints. Secondly, because of the complex data transfer between the GPU memory and the host memory, it is a difficult problem how to implement efficient checkpoint storage without significantly affecting system performance. Furthermore, GPU applications typically involve a large amount of computational state and data dependencies, and how to accurately capture and restore these states to ensure correctness and consistency of the program is also an important challenge in checkpoint recovery techniques.
The existing GPU check point recovery scheme has the defects that firstly, due to the parallelism and large-scale data processing characteristics of a GPU architecture, significant performance overhead is possibly caused in the creation and storage processes of check points, secondly, when faults occur, the recovery process cannot accurately reproduce a calculation state so as to influence the correctness and consistency of a program, and thirdly, the storage strategy of check point data is possibly inflexible and difficult to adapt to requirements in different application scenes.
Based on the above, the present application provides a method for creating a checkpoint of a GPU application, a method for restoring the checkpoint, a creating apparatus, a device and a storage medium, so as to solve the above problems.
The creation method and the recovery method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. The terminal 102 may be, but is not limited to, various personal computers, notebook computers, tablet computers, etc. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.
The terminal 102 in embodiments of the present application may be provided with GPU equipment or without GPU equipment. When the terminal 102 is provided with a GPU device, the client may be a client virtual machine, and the server may be a host virtual machine that may directly drive the GPU device. When the terminal 102 is not provided with the GPU device, the client is the terminal device, and the server is the remote device provided with the GPU device.
In one embodiment, as shown in fig. 2, a method for creating a checkpoint of a GPU application is provided, and the method is applied to the terminal 102 in fig. 1 for illustration, and includes the following steps:
In step S110, GPU environment variables are set in the client based on the requirements of the GPU application.
Specifically, in the client environment, firstly, GPU environment variables related to the running of the GPU application program need to be configured, wherein the GPU environment variables include parameters such as the size of a GPU video memory, the calculation power requirement of the GPU, the required GPU card number and the like. By setting the GPU environment variables, the GPU resources can be ensured to be reasonably distributed and used, and meanwhile, the running requirements of the current GPU application program can be met. In a specific example, a GPU environment variable of Linux is set in a guest virtual machine, for example, 1 GPU card (eGPU =0) is specified to limit GPU utilization to 10%, and GPU video memory size to 500M (eGPUMEM =500, egpuWEight=10).
Step S120, the GPU application program is configured to call the compiled dynamic library.
Specifically, in the running environment of the GPU application program of the client, the dynamic link library related to the GPU API needs to be replaced by a pre-compiled dynamic library, so that the compiled dynamic library can be called by the GPU application program, and the GPU application program can use the dynamic library preferentially when starting to run, thereby realizing interception and hijacking of the GPU API. In a specific example, the pre-compiled dynamic library is eCUDA-client so, and at this time, the environment variable ld_pre= eCUDA-client so is set to ensure that the dynamic library is invoked when the GPU application is started.
In step S130, the program running in relation to the CPU in the GPU application is executed by the CPU of the client.
Specifically, the GPU application program includes a program operated by a CPU and a program operated by a GPU, and when the GPU application program is operated at a client, a code of a CPU portion of the GPU application program is normally operated at a local client, and the code is mainly responsible for tasks such as logic control and data preprocessing.
Step S140, executing a program related to the operation of the GPU in the GPU application program through remote procedure call;
Specifically, because the remote procedure call is executed by the server connected to the client, when the GPU application needs to execute the code related to the GPU, the client sends a request to the server on the remote GPU node through the Remote Procedure Call (RPC), and after the server receives the request, the server executes a corresponding calculation task by using the GPU resource, and returns a result to the client. In a specific example, during an initialization phase, the dynamic library eCUDA-client.so reads configured environment variables and polls the scheduling GPU node. And then the GPU computing task is sent to a remote GPU server for execution through the RPC.
Step S150, recording the first execution state information of the CPU in the client.
Specifically, when a checkpoint is created, execution state information of a CPU portion in the client is recorded and used as first execution state information. The first execution state information includes register contents, a Program Counter (PC), a call stack, etc. so that the execution environment of the CPU section can be reconstructed upon restoration. As a specific example, the first execution state information may be recorded using a Checkpoint/Restore In Userspace tool.
In step S160, the second execution state information of the GPU in the server is recorded.
Specifically, when the checkpoint is created, the execution state information of the GPU in the server needs to be recorded, and the execution state information is used as second execution state information. The second execution state information includes memory allocation status, thread synchronization status, kernel execution progress, etc., which are critical to ensure consistency of the GPU state after recovery.
Step S170, determining checkpoint information based on the first and second execution state information.
Specifically, the first execution state information of the client CPU part and the second execution state information of the server GPU part are integrated finally, and complete check point information is determined. Checkpoint information is used to resume execution from a checkpoint after a program interrupt. Checkpoint information may be saved to persistent storage (e.g., disk, network storage, etc.) for restoration when needed.
According to the method for creating the GPU application program check point, the GPU environment variable is set in the client according to the requirements of the GPU application program, the GPU application program is then configured to call the compiled dynamic library so that the dynamic library can intercept and process related API calls, when the GPU application program runs, the CPU of the client executes the program related to the CPU running in the GPU application program, and when the program related to the GPU running in the GPU application program is executed through the remote procedure call, the first execution state information of the CPU of the client is recorded, the second execution state information of the GPU of the server is recorded and used as check point information, and therefore the creation of the check point is completed. The method for creating the check point can reduce the storage time and space expenditure of the check point, and ensure that the creation of the check point does not significantly influence the overall performance of the GPU application program in a complex computing task.
In one embodiment, as shown in fig. 3, in step S160, the step of recording the second execution state information of the GPU in the server includes:
step S161, obtaining running information of the GPU application on the GPU.
Specifically, when the second execution state information of the GPU in the server is recorded, the running information of the GPU application on the GPU is first obtained. The operation information comprises memory allocation, thread synchronization and kernel execution state. The memory allocation is used for recording the sizes, positions and ownership of all allocated memory blocks on the GPU, is beneficial to reallocating the memory blocks and ensuring the consistency of data when recovering, and the thread synchronization is used for recording the execution states and synchronization points of all threads on the GPU, including the dependency relationship among threads, the waiting condition, the semaphore state and the like, and is important for correctly arranging the execution sequence of the threads when recovering. The kernel execution state is used to record the state of the GPU kernel currently being executed, including the start-up parameters of the kernel, execution progress and any pending operations, to help restart the kernel upon recovery and continue execution from the correct location. As a specific example, when a GPU application program uses a CUDA API to execute code on the GPU, the GPU application program can use the API provided by the GPU driver (such as the CUDA run time API) to query memory allocation, monitor the execution of GPU threads, and use synchronization primitives (such as CUDA events and streams) to track thread state and synchronization points.
In step S162, the internal information of the GPU is acquired through the debugger API.
Specifically, to get a deeper understanding of the state of the GPU, the debugger API needs to be used to access the internal information of the GPU. The internal information comprises a synchronous state, a program counter and a call stack. The synchronization state is used to obtain the current state of all synchronization primitives (e.g., semaphores, fences, and events) on the GPU, helping to reestablish the correct synchronization relationship upon recovery. The program counter is used to record the program counter value for each GPU thread, which reflects the current execution instruction location of the thread, and upon resumption, the thread may be repositioned to the correct execution point based on this information. The call stack is used for acquiring call stack information of each GPU thread, including a function call sequence and parameters, and is helpful for reconstructing the execution context of the thread when recovering. For example, when a GPU application executes code on a GPU using a CUDA API, a GPU debugger API (NVIDIA NSIGHT computer) may be used to connect to the GPU and query its internal state.
In step S163, the memory information is acquired.
Specifically, the memory information is an important component of the GPU state, and records the allocation and usage of all memory blocks on the GPU. The memory information comprises static memory information and dynamic memory information. Static memory information is used to record the size and location of statically allocated memory blocks (e.g., global variables and constant memory) on a GPU, which typically do not change during program execution. Dynamic memory information is used to record the size, location, and current content of dynamically allocated memory blocks (e.g., memory allocated by cudaMalloc) on a GPU, which may change as the program executes.
In step S164, the running information, the internal information and the memory information are used as the second execution status information.
Specifically, finally, all the information collected in the steps are integrated to form complete second execution state information. The second execution state information may be saved to persistent storage (e.g., disk, network storage, etc.) for retrieval when needed.
In one embodiment, the step of obtaining memory information in step S163 includes obtaining static memory information based on static memory region information and obtaining dynamic memory information based on dynamic memory region information.
Specifically, a static memory region refers to a memory region that has been sized and located at program compilation time, and is typically used to store global variables and constants. In identifying static memory regions based on static memory region information, this may be accomplished by looking at compiled binary files or using debug tools. After determining the location and size of the static memory regions, the content of these regions may be copied to the memory of the client for recording using an API provided by the GPU (e.g., cudaMemcpy functions of CUDA). Dynamic memory regions refer to memory regions that are dynamically allocated and released as needed when a program is running, and the size and location of these regions are uncertain at compile time and are determined by the program as needed at run time. In some embodiments, an attempt may be made to request a particular memory address when allocating memory to determine corresponding dynamic memory region information, and when restored, the memory may be restored to the original address. In some embodiments, dynamic memory allocation may be tracked using a memory allocation function provided by the GPU, requiring a record of the memory size, address, and associated metadata (e.g., allocation time, belonging threads, etc.) of each allocation.
In one embodiment, the step of obtaining the dynamic memory information based on the dynamic memory region information includes using API call information associated with the dynamic memory as the dynamic memory region information. Specifically, in this embodiment, the API call information related to the dynamic memory is recorded, and is used as the dynamic memory area information. When the record point is restored, the API calls are executed again according to the order and parameters of the record, and since the parameters (such as allocation size) of the API calls are consistent with the original checkpoint, the dynamic memory is highly likely to be allocated to the same or similar location as when the checkpoint was created.
In one embodiment, as shown in fig. 4, the present application further provides a method for recovering a checkpoint of a GPU application, including the following steps:
step S210, loading checkpoint information.
Specifically, during the running process of the GPU application, a checkpoint may be created in a plurality of nodes, so as to obtain a plurality of checkpoint information. The checkpoint information is created by the method for creating a checkpoint for the GPU application in the above embodiments. When the GPU application needs to be restored from the latest checkpoint, the previously stored checkpoint information needs to be loaded first. I.e. the corresponding data is read from the storage medium and de-sequenced into a format recognizable by the program.
Step S220, the GPU application is continuously executed based on the recovery point and the checkpoint information.
Specifically, after the checkpoint information is loaded, the GPU application may continue to execute from the recovery point based on the checkpoint information. Checkpoint recovery includes recovering the execution state of the CPU code and the execution state on the GPU. In a specific example, when the execution state of the CPU code is restored, the execution state of the CPU code in the GPU application program is restored by using a Checkpoint/Restore In Userspace tool. When the execution state on the GPU is restored, the debugger API or other tool is used to access and restore the execution state information of the GPU application on the GPU. The recovered information is the same as the information recorded in the above embodiment, and will not be described in detail here.
According to the method for recovering the GPU application program checkpoints, the stored checkpoint information is loaded, and the GPU application program is continuously executed according to the recovery point and the checkpoint information, so that the GPU application program can be quickly recovered and continuously executed when hardware faults, software errors or other unpredictable events are encountered, and the reliability and the stability of a system are remarkably improved.
In one embodiment, the method for restoring a GPU application checkpoint further comprises determining a restoration point by modifying a jump instruction in the GPU application. Specifically, the resume point is a specific location where the GPU application can continue to execute after the interrupt, and by modifying the jump instruction in the GPU application, the resume point can be flexibly set so as to resume execution of the GPU application from that point when needed. In some embodiments, by modifying critical instructions such as a Program Counter (PC), it is ensured that the GPU application begins execution from the resume point rather than the initial position.
The method for creating and recovering the GPU application checkpoints according to the present application is described in detail below in a specific embodiment. As shown in fig. 5, the GPU application program uses the CUDA API to execute the code on the GPU, and when the GPU application program is running, the GPU environment variable is set in the client according to the GPU resource required to be occupied by the GPU running part in the GPU application program, so as to complete the setting of the parameters of the GPU resource occupied by the program, execute the corresponding code through the remote GPU in the server, and return the result of the GPU API through the remote procedure call. Meanwhile, programs related to CPU operation in the GPU application programs are operated through a local CPU. And after the result returned by the remote procedure call is summarized with the result operated by the local CPU, the operation of the GPU application program can be completed. When the check point is created, a CUDA-GDB session is added in the client CPU, so that the client CPU can debug the GPU application program, the client CPU obtains a CPU debugger API to stop a function (kernel) executed on the GPU in parallel, meanwhile, the execution states (specific states of the current running time, such as a program counter, register content, memory allocation condition and the like) and the persistent execution states (state information saved in the whole execution period, such as an initialization configuration, a loaded library file, key data and the like) of the CPU and the GPU are saved, and finally the CUDA-GDB session is separated from the client CPU. When the check point recovery is carried out, firstly patching a jump instruction in a binary file of the GPU application program, then starting to apply corresponding check point information, simultaneously acquiring a CPU debugger API, checking and preparing a code for recovering the repair, then recovering an execution state and a persistent execution state of the CPU and the GPU during operation, completing the recovery of the check point, and separating the corresponding recovered and repaired code.
The creation method and the recovery method of the application have the following advantages:
The efficiency of the checkpoint creation is improved by improving the checkpoint creation method, so that the storage time and space overhead of the checkpoint are reduced, and the checkpoint creation is ensured not to significantly influence the overall performance of the GPU application program in complex computing tasks.
The application provides a recovery mechanism capable of accurately capturing the GPU computing state, ensures that the GPU application program can continue to execute in the shortest time after recovering from the check point, and does not influence the correctness and consistency of the computing result.
The method for creating the check point can efficiently manage the check point data according to different application scenes and requirements so as to adapt to the application requirements of GPU application programs in various environments.
The fault tolerance of the system is enhanced, namely, by the recovery method of the check point, the GPU application program can realize quick recovery and continuous execution when facing hardware faults, software errors or other unpredictable events, thereby obviously improving the reliability and the stability of the system.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a device for creating and recovering the GPU application program check point for realizing the method for creating and recovering the GPU application program check point. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitations in the embodiments of the creation device and the recovery device of the GPU application checkpoint provided below may be referred to the above limitations of the creation method and the recovery method, and are not repeated herein.
In one embodiment, as shown in FIG. 6, there is provided a GPU application checkpoint creation apparatus, comprising an environment setting module 310, a dynamic library update module 320, a program running module 330, a remote call module 340, a first information recording module 350, a second information recording module 360, and a checkpoint determination module 370, wherein:
an environment setting module 310, configured to set GPU environment variables in the client based on requirements of the GPU application;
a dynamic library updating module 320, configured to configure the GPU application to invoke the compiled dynamic library;
a program running module 330, configured to execute, by a CPU of the client, a program related to the CPU running in the GPU application program;
A remote call module 340, configured to execute a program related to GPU operation in the GPU application program through a remote procedure call, where the remote procedure call is executed together by a server connected to the client;
a first information recording module 350, configured to record first execution state information of the CPU in the client;
A second information recording module 360, configured to record second execution state information of the GPU in the server;
the checkpoint determination module 370 is configured to determine checkpoint information based on the first and second execution state information.
In one embodiment, the second information recording module 360 is further configured to obtain operation information of the GPU application on the GPU, where the operation information includes memory allocation, thread synchronization, kernel execution status, and internal information of the GPU obtained through the debugger API, where the internal information includes a synchronization status, a program counter, and a call stack, and obtain memory information, where the memory information includes static memory information and dynamic memory information, and the operation information, the internal information, and the memory information are used as the second execution status information.
In one embodiment, the second information recording module 360 is further configured to obtain static memory information based on the static memory area information, and obtain dynamic memory information based on the dynamic memory area information.
In one embodiment, the second information recording module 360 is further configured to use the API call information related to the dynamic memory as the dynamic memory area information.
In one embodiment, a recovery device for a GPU application program check point is provided, which comprises an information loading module and a program execution module, wherein:
the information loading module is used for loading checkpoint information, wherein the checkpoint information is created by the method for creating the checkpoint of the GPU application program in the embodiment;
and the program execution module is used for continuously executing the GPU application program based on the recovery point and the check point information.
In one embodiment, the restoration device of the GPU application checkpoints further comprises a restoration point setting module for determining restoration points by modifying jump instructions in the GPU application.
The respective modules in the creation means and the restoration means described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a creation method and a restoration method. The display unit of the computer device is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the structure shown in FIG. 7 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer readable storage medium is provided, on which a computer program is stored, which processor implements the steps of the method embodiments described above when executing the computer program.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (10)

1.一种GPU应用程序检查点的创建方法,其特征在于,所述方法包括:1. A method for creating a GPU application checkpoint, characterized in that the method comprises: 基于GPU应用程序的需求在客户端中设置GPU环境变量;Set GPU environment variables in the client based on the needs of the GPU application; 配置所述GPU应用程序调用编译好的动态库;Configure the GPU application to call the compiled dynamic library; 通过所述客户端的CPU执行所述GPU应用程序中涉及CPU运行的程序;Executing, by the CPU of the client, a program in the GPU application program that involves CPU operation; 通过远程过程调用执行所述GPU应用程序中涉及GPU运行的程序;其中,所述远程过程调用由与所述客户端连接的服务端共同执行;Executing a program in the GPU application program related to GPU operation through a remote procedure call; wherein the remote procedure call is jointly executed by a server connected to the client; 记录所述客户端中CPU的第一执行状态信息;Recording first execution state information of the CPU in the client; 记录所述服务端中GPU的第二执行状态信息;Recording second execution state information of the GPU in the server; 基于所述第一执行状态信息和所述第二执行状态信息确定检查点信息。Checkpoint information is determined based on the first execution state information and the second execution state information. 2.根据权利要求1所述的GPU应用程序检查点的创建方法,其特征在于,所述记录所述服务端中GPU的第二执行状态信息的步骤,包括:2. The method for creating a GPU application checkpoint according to claim 1, wherein the step of recording the second execution state information of the GPU in the server comprises: 获取所述GPU应用程序在GPU上的运行信息;其中,所述运行信息包括:内存分配、线程同步、内核执行状态;Obtaining the running information of the GPU application on the GPU; wherein the running information includes: memory allocation, thread synchronization, and kernel execution status; 通过调试器API获取GPU的内部信息;其中,所述内部信息包括:同步状态、程序计数器、调用栈;Obtaining internal information of the GPU through a debugger API; wherein the internal information includes: synchronization status, program counter, and call stack; 获取内存信息;其中,所述内存信息包括:静态内存信息和动态内存信息;Acquire memory information; wherein the memory information includes: static memory information and dynamic memory information; 将所述运行信息、所述内部信息和所述内存信息作为所述第二执行状态信息。The running information, the internal information and the memory information are used as the second execution state information. 3.根据权利要求2所述的GPU应用程序检查点的创建方法,其特征在于,所述获取内存信息的步骤,包括:3. The method for creating a GPU application checkpoint according to claim 2, wherein the step of obtaining memory information comprises: 基于静态内存区域信息获取所述静态内存信息;Acquire the static memory information based on the static memory area information; 基于动态内存区域信息获取所述动态内存信息。The dynamic memory information is acquired based on the dynamic memory area information. 4.根据权利要求3所述的方法,其特征在于,所述基于动态内存区域信息获取所述动态内存信息的步骤,包括:4. The method according to claim 3, characterized in that the step of obtaining the dynamic memory information based on the dynamic memory area information comprises: 将与动态内存相关的API调用信息作为所述动态内存区域信息。The API call information related to the dynamic memory is used as the dynamic memory area information. 5.一种GPU应用程序检查点的恢复方法,其特征在于,所述方法包括:5. A method for recovering a GPU application checkpoint, characterized in that the method comprises: 加载检查点信息;其中,所述检查点信息由权利要求1至4任一项所述的GPU应用程序检查点的创建方法创建得到;Loading checkpoint information; wherein the checkpoint information is created by the method for creating a GPU application checkpoint according to any one of claims 1 to 4; 基于恢复点和所述检查点信息继续执行GPU应用程序。The GPU application is continued to be executed based on the recovery point and the checkpoint information. 6.根据权利要求5所述的GPU应用程序检查点的恢复方法,其特征在于,所述方法还包括:6. The method for recovering a GPU application checkpoint according to claim 5, characterized in that the method further comprises: 通过修改所述GPU应用程序中的跳转指令确定所述恢复点。The recovery point is determined by modifying a jump instruction in the GPU application. 7.一种GPU应用程序检查点的创建装置,其特征在于,所述装置包括:7. A device for creating a GPU application checkpoint, characterized in that the device comprises: 环境设置模块,用于基于GPU应用程序的需求在客户端中设置GPU环境变量;Environment setting module, used to set GPU environment variables in the client based on the needs of GPU applications; 动态库更新模块,用于配置所述GPU应用程序调用编译好的动态库;A dynamic library update module, used to configure the GPU application to call the compiled dynamic library; 程序运行模块,用于通过所述客户端的CPU执行所述GPU应用程序中涉及CPU运行的程序;A program running module, used for executing a program involving CPU running in the GPU application program through the CPU of the client; 远程调用模块,用于通过远程过程调用执行所述GPU应用程序中涉及GPU运行的程序;其中,所述远程过程调用由与所述客户端连接的服务端共同执行;A remote calling module, used for executing a program related to GPU operation in the GPU application through a remote procedure call; wherein the remote procedure call is jointly executed by a server connected to the client; 第一信息记录模块,用于记录所述客户端中CPU的第一执行状态信息;A first information recording module, used to record first execution state information of the CPU in the client; 第二信息记录模块,用于记录所述服务端中GPU的第二执行状态信息;A second information recording module, used to record second execution state information of the GPU in the server; 检查点确定模块,用于基于所述第一执行状态信息和所述第二执行状态信息确定检查点信息。A checkpoint determination module is used to determine checkpoint information based on the first execution status information and the second execution status information. 8.根据权利要求7所述的GPU应用程序检查点的创建装置,其特征在于,所述第二信息记录模块,还用于获取所述GPU应用程序在GPU上的运行信息;其中,所述运行信息包括:内存分配、线程同步、内核执行状态;通过调试器API获取GPU的内部信息;其中,所述内部信息包括:同步状态、程序计数器、调用栈;获取内存信息;其中,所述内存信息包括:静态内存信息和动态内存信息;将所述运行信息、所述内部信息和所述内存信息作为所述第二执行状态信息。8. The device for creating a checkpoint for a GPU application according to claim 7 is characterized in that the second information recording module is also used to obtain the running information of the GPU application on the GPU; wherein the running information includes: memory allocation, thread synchronization, and kernel execution status; obtain internal information of the GPU through a debugger API; wherein the internal information includes: synchronization status, program counter, and call stack; obtain memory information; wherein the memory information includes: static memory information and dynamic memory information; and use the running information, the internal information, and the memory information as the second execution status information. 9.一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至6中任一项所述的方法的步骤。9. A computer device, comprising a memory and a processor, wherein the memory stores a computer program, wherein the processor implements the steps of the method according to any one of claims 1 to 6 when executing the computer program. 10.一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至6中任一项所述的方法的步骤。10. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 6 are implemented.
CN202411701575.6A 2024-11-26 2024-11-26 Method for creating and restoring a GPU application checkpoint, device, equipment, and storage medium Pending CN119576563A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411701575.6A CN119576563A (en) 2024-11-26 2024-11-26 Method for creating and restoring a GPU application checkpoint, device, equipment, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411701575.6A CN119576563A (en) 2024-11-26 2024-11-26 Method for creating and restoring a GPU application checkpoint, device, equipment, and storage medium

Publications (1)

Publication Number Publication Date
CN119576563A true CN119576563A (en) 2025-03-07

Family

ID=94811994

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411701575.6A Pending CN119576563A (en) 2024-11-26 2024-11-26 Method for creating and restoring a GPU application checkpoint, device, equipment, and storage medium

Country Status (1)

Country Link
CN (1) CN119576563A (en)

Similar Documents

Publication Publication Date Title
US10884870B2 (en) Method and system for implementing consistency groups with virtual machines
Nukada et al. NVCR: A transparent checkpoint-restart library for NVIDIA CUDA
US9323550B2 (en) Mechanism for providing virtual machines for use by multiple users
US6795966B1 (en) Mechanism for restoring, porting, replicating and checkpointing computer systems using state extraction
US7370164B1 (en) Backup of virtual machines from the base machine
EP4014119B1 (en) Data race analysis based on altering function internal loads during time-travel debugging
US11016861B2 (en) Crash recoverability for graphics processing units (GPU) in a computing environment
CN101154185A (en) Software Runtime Execution Recovery and Replay Method
EP3652648B1 (en) Replaying time-travel traces relying on processor undefined behavior
US10817285B1 (en) Transactional migration system
CN110083488A (en) A kind of tolerant system of the fine granularity low overhead towards GPGPU
Xie et al. X10-FT: Transparent fault tolerance for APGAS language and runtime
WO2024192196A1 (en) System and method supporting highly-available replicated computing applications using deterministic virtual machines
CN119576563A (en) Method for creating and restoring a GPU application checkpoint, device, equipment, and storage medium
US5878428A (en) System, method, and article of manufacture for adding transactional recovery to a binary class in an object oriented system
Liu et al. A large-scale rendering system based on hadoop
CN116820715A (en) Job restarting method, apparatus, computer device and readable storage medium
US20130166887A1 (en) Data processing apparatus and data processing method
CN104516778B (en) A system and method for saving and restoring process checkpoints in a multi-tasking environment
Jiang et al. A checkpoint/restart scheme for cuda programs with complex computation states
Baird et al. Checkpointing kernel executions of MPI+ CUDA applications
US20130179731A1 (en) Recovering from a thread hang
CN116578446B (en) Virtual machine backup method, device, system, electronic equipment and storage medium
CN119645551A (en) Cloud desktop management method, device, equipment, storage medium and program product
CN118394545A (en) Log writing method, log writing device, computer equipment, readable storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination