CN113760469A

CN113760469A - Distributed computing method and device

Info

Publication number: CN113760469A
Application number: CN202110144149.7A
Authority: CN
Inventors: 张猛
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2021-02-02
Filing date: 2021-02-02
Publication date: 2021-12-07

Abstract

The invention discloses a distributed computing method and device, and relates to the technical field of computers. One embodiment of the method comprises: acquiring computing task data to be computed; splitting the computing task data into multiple sub-computing task data, distributing the multiple sub-computing task data to multiple nodes, and computing the multiple sub-computing tasks through the multiple nodes; storing the state data of the computing task in stages, and marking the state data of the sub-computing tasks computed by each node in the stored state data with corresponding node information respectively; and when the fault node occurs, extracting the state data of the last stage of the sub-computation task computed by the fault node according to the node information marked on the state data, and recovering the sub-computation task computed by the fault node according to the state data of the last stage of the sub-computation task computed by the fault node in the normal node. This embodiment can reduce the time and the amount of calculation required for the failure recovery as compared with the prior art.

Description

Distributed computing method and device

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a distributed computing method and apparatus.

Background

In recent years, with the development of technology, a distributed computing approach is used to satisfy a high-performance computing scenario. Compared with the traditional centralized computing, the distributed computing divides the computing task into a plurality of sub-computing tasks, utilizes a plurality of machines to compute the plurality of sub-computing tasks, and computes the final result by combining the results obtained after the sub-computing tasks are finished.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:

based on the distributed computing method, in the computing process, when a problem occurs in any one of the sub-computing tasks (for example, a machine in which the sub-computing task is located is down), the whole computing task is restarted. That is, when any sub-computation task has a problem, all sub-computations will restart from a certain historical state.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for distributed computing, which can reduce the time and the amount of computation required for fault recovery.

To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a distributed computing method including:

acquiring computing task data to be computed;

splitting the computing task data into multiple sub-computing task data, distributing the multiple sub-computing task data to multiple nodes, and computing multiple sub-computing tasks through the multiple nodes;

storing state data of the computing tasks in stages, and marking the state data of the sub-computing tasks computed by each node in the stored state data with corresponding node information respectively; and

when a failed node occurs, extracting state data of a previous stage of a sub-computation task computed by the failed node according to the node information marked on the state data, and recovering the sub-computation task computed by the failed node in a normal node according to the state data of the previous stage of the sub-computation task computed by the failed node.

In the above-described method of distributed computation, it may be,

the step of, when a failed node occurs, extracting state data of a previous stage of a sub-computation task computed by the failed node based on the node information marked for the state data, and restoring, in a normal node, the sub-computation task computed by the failed node based on the state data of the previous stage of the sub-computation task computed by the failed node, includes:

when a fault node occurs, extracting state data of a previous stage of a sub-computation task computed by the fault node according to the node information marked by the state data; and

suspending the sub-computation tasks computed by the other nodes until the sub-computation tasks computed by the failed node have been recovered in the normal node from the state data of the previous stage of the sub-computation tasks computed by the failed node.

In the above-described method of distributed computation, it may be,

the step of suspending the sub-computation tasks computed by the other nodes until the sub-computation task computed by the failed node has been recovered in the normal node from the state data of the last stage of the sub-computation task computed by the failed node comprises:

suspending the sub-computation tasks computed by other nodes, and respectively storing all data generated by the computation of each node in the current node; and

when the sub-computation tasks computed by the failed node have been restored in the normal node, the sub-computation tasks computed by the other nodes are started, and data resulting from the computation is distributed among the nodes.

In the above-described method of distributed computation, it may be,

the step of storing the state data of the computation task in stages and marking the state data of the sub-computation tasks computed by each node in the stored state data with corresponding node information respectively comprises the following steps:

and storing the state data of the computing task in stages, and marking the state data of the sub-computing task computed by each node in the stored state data with sub-computing task information respectively.

In the above-described method of distributed computation, it may be,

when a fault node occurs, judging whether a sub-computation task related to the sub-computation task influenced by the fault node exists in other nodes or not according to node information and sub-computation task information respectively marked on state data of the sub-computation task computed by the fault node, and screening out nodes with the related sub-computation tasks; and

suspending the sub-computation tasks in the failed node and the node in which the related sub-computation task exists until the sub-computation task computed by the failed node has been recovered in the normal node according to the state data of the last stage of the sub-computation task computed by the failed node.

In the above-described method of distributed computation, it may be,

the step of suspending the sub-computation tasks in the failed node and the node in which the related sub-computation task exists until the sub-computation task computed by the failed node has been recovered in the normal node from the state data of the last stage of the sub-computation task computed by the failed node includes:

suspending the sub-computing tasks in the fault node and the node with the related sub-computing tasks, and respectively storing data generated by computing each node in the current node; and

when the sub-computation task computed by the failed node has been restored in the normal node based on the state data of the last stage of the sub-computation task computed by the failed node, the sub-computation task computed by the failed node and the sub-computation task computed by the node where the related sub-computation task exists are started, and data resulting from the computation is distributed among the nodes.

In the above-described method of distributed computation, it may be,

the node periodically sends state data to the management node, and when the management node does not receive the state data of the node, the node is judged to have a fault.

According to another aspect of the embodiments of the present invention, there is provided an apparatus for distributed computing, including:

the task acquisition module acquires computing task data to be computed;

the task allocation module divides the computing task data acquired by the task acquisition module into a plurality of sub-computing task data, allocates the plurality of sub-computing task data to a plurality of nodes, and calculates a plurality of sub-computing tasks through the plurality of nodes;

the state storage module stores the states of the computing tasks in stages and marks corresponding node information on the state data of the sub-computing tasks computed by each node in the stored state data respectively; and

a fault recovery module that extracts, when a faulty node occurs, state data of a previous stage of a sub-computation task calculated by the faulty node according to the node information marked for the state data, and recovers, in a normal node, the sub-computation task calculated by the faulty node according to the state data of the previous stage of the sub-computation task calculated by the faulty node.

In the above-described distributed computing apparatus, it may be,

when a fault node occurs, the fault recovery module extracts state data of the last stage of the sub-computing task computed by the fault node according to the node information marked by the state data; and

In the above-described distributed computing apparatus, it may be,

the fault recovery module suspends the sub-computing tasks computed by other nodes and stores all data generated by computing of each node in the current node respectively; and

In the above-described distributed computing apparatus, it may be,

the state storage module stores the state data of the computing task in stages, and marks the state data of the sub-computing task computed by each node in the stored state data with sub-computing task information respectively.

In the above-described distributed computing apparatus, it may be,

when a fault node occurs, the fault recovery module judges whether a sub-computation task related to the sub-computation task affected by the fault node exists in other nodes according to node information and sub-computation task information respectively marked on state data of the sub-computation task computed by the fault node, and screens out nodes with the related sub-computation tasks; and

In the above-described distributed computing apparatus, it may be,

the fault recovery module suspends the sub-computing tasks in the fault node and the node with the related sub-computing tasks, and respectively stores data generated by computing each node in the current node; and

In the above-described distributed computing apparatus, it may be,

There is also provided, in accordance with an embodiment of the present invention, an electronic device for distributed computing, including:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement a method as in any one of the above.

There is also provided, in accordance with an embodiment of the present invention, a computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method as set forth in any one of the above.

One embodiment of the above invention has the following advantages or benefits: the state data of the sub-computation tasks computed by each node in the stored state data are respectively marked with corresponding node information, so that when a fault node occurs, the sub-computation tasks computed by the fault node are only needed to be recalculated according to the state data of the previous stage, the problem that the whole computation task needs to be recovered from the previous state when one node fails in the prior art is solved, and the technical effect of reducing the time and the computation amount required by fault recovery is achieved.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of a main flow of a method of distributed computing according to an embodiment of the invention;

FIG. 2 is a schematic diagram of the main modules of an apparatus for distributed computing according to an embodiment of the present invention;

FIG. 3 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 4 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram of a main flow of a distributed computing method according to an embodiment of the present invention, and as shown in fig. 1, the distributed computing method according to the embodiment of the present invention includes:

step S101, computing task data to be computed is obtained.

In the embodiment of the present invention, the data of the computing task to be computed is, for example, data of a computing task with a large data volume and requiring a strong computing power. Of course, even a calculation task with a general data amount can be calculated by the distributed calculation method according to the embodiment of the present invention, so that the time required for calculation can be shortened.

Step S102, splitting the calculation task data obtained in step S101 into a plurality of sub-calculation task data, distributing the plurality of sub-calculation task data to a plurality of nodes, and calculating the plurality of sub-calculation tasks by the plurality of nodes.

In the embodiment of the invention, a computing task with larger data volume and needing stronger computing power is divided into a plurality of sub-computing tasks, the data of the divided sub-computing tasks are distributed to a plurality of nodes, and the plurality of sub-computing tasks are computed through the plurality of nodes, so that the requirement on the computing power of each node is reduced. In the embodiment of the present invention, the node is, for example, a computer, and one or more sub-computation tasks may be allocated to one node.

Step S103 stores the state data of the computation task in stages, and marks the state data of the sub-computation task computed by each node in the stored state data with corresponding node information, respectively.

In the embodiment of the present invention, specifically, for example, barrier (barrier) information is inserted into the sub-computation task data acquired by the node, and when the sub-computation task computes the barrier information, the state data of the sub-computation task is stored, for example, in the management node.

The barrier information has, for example, the same sequence number for all the sub-computation tasks, and when all the sub-computation tasks compute the barrier information with the same sequence number and thus store the state data of all the sub-computation tasks in the management node, the state data of one stage of the computation task is stored. The state data refers to data and configuration information, metadata, and the like of the calculations of all the current calculation nodes, and the state data is stored to the outside, such as a management node. In this specification, the state data value of a sub-computation task is state data of one sub-computation.

And when each node continues to calculate the sub-calculation tasks and all the sub-calculation tasks calculate the barrier information of the next sequence number, storing the state data of all the sub-calculation tasks of the next stage into the management node. When the state data of all the sub-computing tasks are stored in the management node, namely the state data of the next stage of the computing task is stored, the state data of the computing task stored in the previous stage is deleted. This step may also be understood as updating the state data as the computation of the computation task proceeds.

According to the distributed computing method provided by the embodiment of the invention, the state data of the sub-computing tasks computed by each node in the stored state data are respectively marked with corresponding node information. That is, the state of the sub-computation task in the stored state data is marked as to which node it was computed by, whereby the state data of the previous stage of the sub-computation task computed by the node can be called for recovery when the node has a problem.

And step S104, when a fault node occurs, extracting the state data of the previous stage of the sub-computation task computed by the fault node according to the node information marked on the state data, and recovering the sub-computation task computed by the fault node in the normal node according to the state data of the previous stage of the sub-computation task computed by the fault node.

For example, a node (or nodes) may fail due to some problem (machine downtime, power outage, etc.) and may not be able to continue computing. At this time, since the state data of the sub-computation task calculated by each node in the stored state data is respectively marked with the corresponding node information when the state data of the computation task is stored in step S103, it is possible to extract the state data of the previous stage of the sub-computation task calculated by the failed node from the stored state data of the computation task, and to recover the sub-computation task in the failed node from the state data of the previous stage of the sub-computation task calculated by the failed node in the normal node. The management node, for example, newly creates a node to recover the sub-computation task in the failed node based on the state data of the previous stage of the sub-computation task computed by the failed node.

As a comparative example, in the prior art, when a failed node occurs, the stored state data of the computing task is used to restart the computing task from the state of the previous stage, i.e., the prior art needs to recover the sub-computing tasks in all nodes from the state of the previous stage when one node fails.

According to the prior art scheme, the recovery process is slow since all sub-computation tasks are required to recover from the state of the last stage. Moreover, since the sub-computation tasks are associated with each other, the association of the states is relatively tight, and the entire computation can be correctly recovered by confirming the association between all the sub-computations. Furthermore, as the distributed computation itself uses a plurality of nodes to perform a computation task, if any node fails, the whole computation task is recalculated, which is equivalent to double the failure probability.

According to the distributed computing method provided by the embodiment of the invention, the state data of the sub-computing tasks computed by each node in the stored state data are respectively marked with the corresponding node information, so that when a fault node occurs, the sub-computing tasks computed by the fault node only need to be recalculated according to the state data of the previous stage, and the sub-computing tasks in other nodes do not need to be recalculated, so that the computation amount can be reduced and the time required for fault recovery can be shortened compared with the prior art.

In this embodiment of the present invention, optionally, step S104 includes: when a fault node occurs, extracting the state data of the last stage of the sub-computing task computed by the fault node according to the node information marked on the state data; and suspending the sub-computation tasks in the other nodes until the sub-computation task computed by the failed node has been recovered in the normal node according to the state data of the last stage of the sub-computation task computed by the failed node.

In the distributed computing process, one sub-computation may depend on other sub-computations, and therefore, in the present embodiment, when a failed node occurs, the sub-computation tasks in other nodes are suspended, and when the sub-computation task originally computed by the failed node is recovered in a normal node according to the state data of the last stage of the sub-computation task computed by the failed node, the sub-computation tasks in other nodes are restarted to be computed.

When a fault node occurs and the sub-computation tasks in other nodes are suspended, all data generated by the computation of each node are stored in the current node, and when the sub-computation tasks originally computed by the fault node are recovered in the normal node and the computation of the sub-computation tasks in other nodes is restarted, the data generated by the sub-computation tasks in the computation process are distributed among the nodes.

It should be noted that, although the above embodiment describes a scheme of suspending the sub-computation tasks of other nodes when a failed node occurs, the present invention may not suspend all the sub-computation tasks in the normal nodes. For example, the present invention may employ the following modifications.

According to a modification of the present invention, for example, in step S102, the state data of the computation task is stored in stages, and the state data of the sub-computation task computed by each node in the stored state data is respectively labeled with the corresponding node information, and the state data of the sub-computation task computed by each node in the stored state data is respectively labeled with the sub-computation task information.

That is, in the present modification, not only the state data of the sub-computation tasks computed by each node in the stored state data is labeled with the corresponding node information to label which node the state of each sub-computation task in the stored state data was computed, but also the state data of each sub-computation task in the stored state data is labeled with the sub-computation task information to describe to which sub-computation it belongs.

In a distributed computing process, one sub-computing task may be closely related to other sub-computing tasks, e.g., one sub-computing task may need to obtain computing result data of another sub-computing task to start computing.

However, not all the sub-computation tasks are related, according to the distributed computation method of the modified example of the present invention, node information and sub-computation task information are marked on the stored state data, and when a failed node occurs, it is only necessary to judge which sub-computation tasks are being computed in the failed node according to the node information and the sub-computation task information respectively marked on the state data of the sub-computation tasks computed by the failed node, thereby judging whether the sub-computation tasks related to the sub-computation tasks affected by the failed node exist in other normal nodes.

According to the modified example of the invention, the sub-computing tasks related to the sub-computing tasks affected by the fault node are only required to be screened out, and the sub-computing tasks in the fault node and the nodes with the related sub-computing tasks are suspended, and the sub-computing tasks in all the nodes are not required to be suspended, so that the computing time can be saved. When the sub-computation tasks originally computed by the fault node are recovered in the normal node according to the state data of the last stage of the sub-computation tasks computed by the fault node, the sub-computation tasks originally computed by the fault node and the sub-computation tasks in the nodes having the sub-computation tasks related to the sub-computation tasks computed by the fault node are restarted.

In the present modification, when suspending the sub-computation tasks in the failed node and the node in which the related sub-computation task exists, the data generated by the computation of each node may be stored in the current node, and the data of the other nodes that normally compute may be distributed normally. When the sub-computation tasks originally computed by the failed node are recovered in the normal nodes according to the state data of the last stage of the sub-computation tasks computed by the failed node, the sub-computation tasks originally computed by the failed node and the sub-computation tasks computed by the nodes with the related sub-computation tasks are restarted, and data generated by computation are distributed among the nodes.

Optionally, in the embodiment of the present invention, the node periodically sends the status data to the management node, and when the management node does not receive the status data of the node, it is determined that the node has a fault.

Fig. 2 shows a schematic diagram of the main modules of an apparatus 200 for distributed computing according to an embodiment of the invention. As shown in fig. 2, the apparatus 200 for distributed computing according to the embodiment of the present invention includes:

a task obtaining module 201, which obtains calculation task data to be calculated;

a task allocation module 202, configured to split the computing task data acquired by the task acquisition module 201 into multiple sub-computing task data, allocate the multiple sub-computing task data to multiple nodes, and compute multiple sub-computing tasks through the multiple nodes;

a state storage module 203 for storing the states of the computation tasks in stages and marking the state data of the sub-computation tasks computed by each node in the stored state data with corresponding node information; and

a failure recovery module 204, which extracts state data of a previous stage of the sub-computation task computed by the failed node according to the node information marked for the state data when the failed node occurs, and recovers the sub-computation task computed by the failed node in a normal node according to the state data of the previous stage of the sub-computation task computed by the failed node.

FIG. 3 illustrates an exemplary system architecture 300 of a distributed computing method or apparatus to which embodiments of the invention may be applied.

As shown in fig. 3, the system architecture 300 may include

terminal devices

301, 302, 303, a network 304, and a server 305. The network 304 serves as a medium for providing communication links between the

terminal devices

301, 302, 303 and the server 305. Network 304 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal device

301, 302, 303 to interact with the server 305 via the network 304 to receive or send messages or the like. The

terminal devices

301, 302, 303 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

The

terminal devices

301, 302, 303 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 305 may be a server providing various services, such as a background management server (for example only) providing support for shopping-like websites browsed by users using the

terminal devices

301, 302, 303. The backend management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (for example, target push information, product information — just an example) to the terminal device.

It should be noted that the distributed computing method provided by the embodiment of the present invention is generally executed by the server 305, and accordingly, the distributed computing apparatus is generally disposed in the server 305.

It should be understood that the number of terminal devices, networks, and servers in fig. 3 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 4, a block diagram of a computer system 400 suitable for use with a terminal device implementing an embodiment of the invention is shown. The terminal device shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 4, the computer system 400 includes a Central Processing Unit (CPU)401 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage section 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the system 400 are also stored. The CPU 401, ROM 402, and RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

The following components are connected to the I/O interface 405: an input section 406 including a keyboard, a mouse, and the like; an output section 407 including a display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 408 including a hard disk and the like; and a communication section 409 including a network interface card such as a LAN card, a modem, or the like. The communication section 409 performs communication processing via a network such as the internet. A driver 410 is also connected to the I/O interface 405 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 410 as necessary, so that a computer program read out therefrom is mounted into the storage section 408 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 409, and/or installed from the removable medium 411. The computer program performs the above-described functions defined in the system of the present invention when executed by a Central Processing Unit (CPU) 401.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a task acquisition module, a task allocation module, a state storage module, and a failure recovery module. The names of these modules do not in some cases form a limitation on the modules themselves, and for example, the task assignment module may also be described as a "module that assigns computing tasks to nodes".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise:

acquiring computing task data to be computed;

splitting the computing task data into multiple sub-computing task data, distributing the multiple sub-computing task data to multiple nodes, and computing the multiple sub-computing tasks through the multiple nodes;

storing the state data of the computing task in stages, and marking the state data of the sub-computing tasks computed by each node in the stored state data with corresponding node information respectively; and

when a failed node occurs, extracting state data of a previous stage of the sub-computation task computed by the failed node according to the node information marked for the state data, and restoring the sub-computation task computed by the failed node in the normal node according to the state data of the previous stage of the sub-computation task computed by the failed node.

According to the technical scheme of the embodiment of the invention, the state data of the sub-computation tasks computed by each node in the stored state data are respectively marked with the corresponding node information, so that when a failed node occurs, the sub-computation tasks computed by the failed node only need to be recalculated according to the state data of the previous stage, and the sub-computation tasks in other nodes do not need to be recalculated, so that the time and the computation workload required by fault recovery can be reduced compared with the prior art.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of distributed computing, comprising:

acquiring computing task data to be computed;

2. The method of distributed computing according to claim 1,

3. The method of distributed computing according to claim 2,

4. The method of distributed computing according to claim 1,

5. The method of distributed computing according to claim 4,

6. The method of distributed computing according to claim 5,

7. The method of distributed computing according to any of claims 1-6,

8. An apparatus for distributed computing, comprising:

the task acquisition module acquires computing task data to be computed;

9. An electronic device for distributed computing, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A computer-readable medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method according to any one of claims 1-7.