CN113760469A - Distributed computing method and device - Google Patents
Distributed computing method and device Download PDFInfo
- Publication number
- CN113760469A CN113760469A CN202110144149.7A CN202110144149A CN113760469A CN 113760469 A CN113760469 A CN 113760469A CN 202110144149 A CN202110144149 A CN 202110144149A CN 113760469 A CN113760469 A CN 113760469A
- Authority
- CN
- China
- Prior art keywords
- sub
- node
- computation
- computed
- state data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004364 calculation method Methods 0.000 title abstract description 26
- 238000000034 method Methods 0.000 claims abstract description 30
- 238000011084 recovery Methods 0.000 claims abstract description 15
- 238000004590 computer program Methods 0.000 claims description 9
- 239000000284 extract Substances 0.000 claims description 4
- 238000012216 screening Methods 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 11
- 238000004891 communication Methods 0.000 description 7
- 238000012986 modification Methods 0.000 description 7
- 230000004048 modification Effects 0.000 description 7
- 230000008569 process Effects 0.000 description 7
- 230000004888 barrier function Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000012545 processing Methods 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/465—Distributed object oriented systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5072—Grid computing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5083—Techniques for rebalancing the load in a distributed system
- G06F9/5088—Techniques for rebalancing the load in a distributed system involving task migration
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Hardware Redundancy (AREA)
Abstract
The invention discloses a distributed computing method and device, and relates to the technical field of computers. One embodiment of the method comprises: acquiring computing task data to be computed; splitting the computing task data into multiple sub-computing task data, distributing the multiple sub-computing task data to multiple nodes, and computing the multiple sub-computing tasks through the multiple nodes; storing the state data of the computing task in stages, and marking the state data of the sub-computing tasks computed by each node in the stored state data with corresponding node information respectively; and when the fault node occurs, extracting the state data of the last stage of the sub-computation task computed by the fault node according to the node information marked on the state data, and recovering the sub-computation task computed by the fault node according to the state data of the last stage of the sub-computation task computed by the fault node in the normal node. This embodiment can reduce the time and the amount of calculation required for the failure recovery as compared with the prior art.
Description
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a distributed computing method and apparatus.
Background
In recent years, with the development of technology, a distributed computing approach is used to satisfy a high-performance computing scenario. Compared with the traditional centralized computing, the distributed computing divides the computing task into a plurality of sub-computing tasks, utilizes a plurality of machines to compute the plurality of sub-computing tasks, and computes the final result by combining the results obtained after the sub-computing tasks are finished.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:
based on the distributed computing method, in the computing process, when a problem occurs in any one of the sub-computing tasks (for example, a machine in which the sub-computing task is located is down), the whole computing task is restarted. That is, when any sub-computation task has a problem, all sub-computations will restart from a certain historical state.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for distributed computing, which can reduce the time and the amount of computation required for fault recovery.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a distributed computing method including:
acquiring computing task data to be computed;
splitting the computing task data into multiple sub-computing task data, distributing the multiple sub-computing task data to multiple nodes, and computing multiple sub-computing tasks through the multiple nodes;
storing state data of the computing tasks in stages, and marking the state data of the sub-computing tasks computed by each node in the stored state data with corresponding node information respectively; and
when a failed node occurs, extracting state data of a previous stage of a sub-computation task computed by the failed node according to the node information marked on the state data, and recovering the sub-computation task computed by the failed node in a normal node according to the state data of the previous stage of the sub-computation task computed by the failed node.
In the above-described method of distributed computation, it may be,
the step of, when a failed node occurs, extracting state data of a previous stage of a sub-computation task computed by the failed node based on the node information marked for the state data, and restoring, in a normal node, the sub-computation task computed by the failed node based on the state data of the previous stage of the sub-computation task computed by the failed node, includes:
when a fault node occurs, extracting state data of a previous stage of a sub-computation task computed by the fault node according to the node information marked by the state data; and
suspending the sub-computation tasks computed by the other nodes until the sub-computation tasks computed by the failed node have been recovered in the normal node from the state data of the previous stage of the sub-computation tasks computed by the failed node.
In the above-described method of distributed computation, it may be,
the step of suspending the sub-computation tasks computed by the other nodes until the sub-computation task computed by the failed node has been recovered in the normal node from the state data of the last stage of the sub-computation task computed by the failed node comprises:
suspending the sub-computation tasks computed by other nodes, and respectively storing all data generated by the computation of each node in the current node; and
when the sub-computation tasks computed by the failed node have been restored in the normal node, the sub-computation tasks computed by the other nodes are started, and data resulting from the computation is distributed among the nodes.
In the above-described method of distributed computation, it may be,
the step of storing the state data of the computation task in stages and marking the state data of the sub-computation tasks computed by each node in the stored state data with corresponding node information respectively comprises the following steps:
and storing the state data of the computing task in stages, and marking the state data of the sub-computing task computed by each node in the stored state data with sub-computing task information respectively.
In the above-described method of distributed computation, it may be,
the step of, when a failed node occurs, extracting state data of a previous stage of a sub-computation task computed by the failed node based on the node information marked for the state data, and restoring, in a normal node, the sub-computation task computed by the failed node based on the state data of the previous stage of the sub-computation task computed by the failed node, includes:
when a fault node occurs, judging whether a sub-computation task related to the sub-computation task influenced by the fault node exists in other nodes or not according to node information and sub-computation task information respectively marked on state data of the sub-computation task computed by the fault node, and screening out nodes with the related sub-computation tasks; and
suspending the sub-computation tasks in the failed node and the node in which the related sub-computation task exists until the sub-computation task computed by the failed node has been recovered in the normal node according to the state data of the last stage of the sub-computation task computed by the failed node.
In the above-described method of distributed computation, it may be,
the step of suspending the sub-computation tasks in the failed node and the node in which the related sub-computation task exists until the sub-computation task computed by the failed node has been recovered in the normal node from the state data of the last stage of the sub-computation task computed by the failed node includes:
suspending the sub-computing tasks in the fault node and the node with the related sub-computing tasks, and respectively storing data generated by computing each node in the current node; and
when the sub-computation task computed by the failed node has been restored in the normal node based on the state data of the last stage of the sub-computation task computed by the failed node, the sub-computation task computed by the failed node and the sub-computation task computed by the node where the related sub-computation task exists are started, and data resulting from the computation is distributed among the nodes.
In the above-described method of distributed computation, it may be,
the node periodically sends state data to the management node, and when the management node does not receive the state data of the node, the node is judged to have a fault.
According to another aspect of the embodiments of the present invention, there is provided an apparatus for distributed computing, including:
the task acquisition module acquires computing task data to be computed;
the task allocation module divides the computing task data acquired by the task acquisition module into a plurality of sub-computing task data, allocates the plurality of sub-computing task data to a plurality of nodes, and calculates a plurality of sub-computing tasks through the plurality of nodes;
the state storage module stores the states of the computing tasks in stages and marks corresponding node information on the state data of the sub-computing tasks computed by each node in the stored state data respectively; and
a fault recovery module that extracts, when a faulty node occurs, state data of a previous stage of a sub-computation task calculated by the faulty node according to the node information marked for the state data, and recovers, in a normal node, the sub-computation task calculated by the faulty node according to the state data of the previous stage of the sub-computation task calculated by the faulty node.
In the above-described distributed computing apparatus, it may be,
when a fault node occurs, the fault recovery module extracts state data of the last stage of the sub-computing task computed by the fault node according to the node information marked by the state data; and
suspending the sub-computation tasks computed by the other nodes until the sub-computation tasks computed by the failed node have been recovered in the normal node from the state data of the previous stage of the sub-computation tasks computed by the failed node.
In the above-described distributed computing apparatus, it may be,
the fault recovery module suspends the sub-computing tasks computed by other nodes and stores all data generated by computing of each node in the current node respectively; and
when the sub-computation tasks computed by the failed node have been restored in the normal node, the sub-computation tasks computed by the other nodes are started, and data resulting from the computation is distributed among the nodes.
In the above-described distributed computing apparatus, it may be,
the state storage module stores the state data of the computing task in stages, and marks the state data of the sub-computing task computed by each node in the stored state data with sub-computing task information respectively.
In the above-described distributed computing apparatus, it may be,
when a fault node occurs, the fault recovery module judges whether a sub-computation task related to the sub-computation task affected by the fault node exists in other nodes according to node information and sub-computation task information respectively marked on state data of the sub-computation task computed by the fault node, and screens out nodes with the related sub-computation tasks; and
suspending the sub-computation tasks in the failed node and the node in which the related sub-computation task exists until the sub-computation task computed by the failed node has been recovered in the normal node according to the state data of the last stage of the sub-computation task computed by the failed node.
In the above-described distributed computing apparatus, it may be,
the fault recovery module suspends the sub-computing tasks in the fault node and the node with the related sub-computing tasks, and respectively stores data generated by computing each node in the current node; and
when the sub-computation task computed by the failed node has been restored in the normal node based on the state data of the last stage of the sub-computation task computed by the failed node, the sub-computation task computed by the failed node and the sub-computation task computed by the node where the related sub-computation task exists are started, and data resulting from the computation is distributed among the nodes.
In the above-described distributed computing apparatus, it may be,
the node periodically sends state data to the management node, and when the management node does not receive the state data of the node, the node is judged to have a fault.
There is also provided, in accordance with an embodiment of the present invention, an electronic device for distributed computing, including:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement a method as in any one of the above.
There is also provided, in accordance with an embodiment of the present invention, a computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method as set forth in any one of the above.
One embodiment of the above invention has the following advantages or benefits: the state data of the sub-computation tasks computed by each node in the stored state data are respectively marked with corresponding node information, so that when a fault node occurs, the sub-computation tasks computed by the fault node are only needed to be recalculated according to the state data of the previous stage, the problem that the whole computation task needs to be recovered from the previous state when one node fails in the prior art is solved, and the technical effect of reducing the time and the computation amount required by fault recovery is achieved.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of a main flow of a method of distributed computing according to an embodiment of the invention;
FIG. 2 is a schematic diagram of the main modules of an apparatus for distributed computing according to an embodiment of the present invention;
FIG. 3 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
fig. 4 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram of a main flow of a distributed computing method according to an embodiment of the present invention, and as shown in fig. 1, the distributed computing method according to the embodiment of the present invention includes:
step S101, computing task data to be computed is obtained.
In the embodiment of the present invention, the data of the computing task to be computed is, for example, data of a computing task with a large data volume and requiring a strong computing power. Of course, even a calculation task with a general data amount can be calculated by the distributed calculation method according to the embodiment of the present invention, so that the time required for calculation can be shortened.
Step S102, splitting the calculation task data obtained in step S101 into a plurality of sub-calculation task data, distributing the plurality of sub-calculation task data to a plurality of nodes, and calculating the plurality of sub-calculation tasks by the plurality of nodes.
In the embodiment of the invention, a computing task with larger data volume and needing stronger computing power is divided into a plurality of sub-computing tasks, the data of the divided sub-computing tasks are distributed to a plurality of nodes, and the plurality of sub-computing tasks are computed through the plurality of nodes, so that the requirement on the computing power of each node is reduced. In the embodiment of the present invention, the node is, for example, a computer, and one or more sub-computation tasks may be allocated to one node.
Step S103 stores the state data of the computation task in stages, and marks the state data of the sub-computation task computed by each node in the stored state data with corresponding node information, respectively.
In the embodiment of the present invention, specifically, for example, barrier (barrier) information is inserted into the sub-computation task data acquired by the node, and when the sub-computation task computes the barrier information, the state data of the sub-computation task is stored, for example, in the management node.
The barrier information has, for example, the same sequence number for all the sub-computation tasks, and when all the sub-computation tasks compute the barrier information with the same sequence number and thus store the state data of all the sub-computation tasks in the management node, the state data of one stage of the computation task is stored. The state data refers to data and configuration information, metadata, and the like of the calculations of all the current calculation nodes, and the state data is stored to the outside, such as a management node. In this specification, the state data value of a sub-computation task is state data of one sub-computation.
And when each node continues to calculate the sub-calculation tasks and all the sub-calculation tasks calculate the barrier information of the next sequence number, storing the state data of all the sub-calculation tasks of the next stage into the management node. When the state data of all the sub-computing tasks are stored in the management node, namely the state data of the next stage of the computing task is stored, the state data of the computing task stored in the previous stage is deleted. This step may also be understood as updating the state data as the computation of the computation task proceeds.
According to the distributed computing method provided by the embodiment of the invention, the state data of the sub-computing tasks computed by each node in the stored state data are respectively marked with corresponding node information. That is, the state of the sub-computation task in the stored state data is marked as to which node it was computed by, whereby the state data of the previous stage of the sub-computation task computed by the node can be called for recovery when the node has a problem.
And step S104, when a fault node occurs, extracting the state data of the previous stage of the sub-computation task computed by the fault node according to the node information marked on the state data, and recovering the sub-computation task computed by the fault node in the normal node according to the state data of the previous stage of the sub-computation task computed by the fault node.
For example, a node (or nodes) may fail due to some problem (machine downtime, power outage, etc.) and may not be able to continue computing. At this time, since the state data of the sub-computation task calculated by each node in the stored state data is respectively marked with the corresponding node information when the state data of the computation task is stored in step S103, it is possible to extract the state data of the previous stage of the sub-computation task calculated by the failed node from the stored state data of the computation task, and to recover the sub-computation task in the failed node from the state data of the previous stage of the sub-computation task calculated by the failed node in the normal node. The management node, for example, newly creates a node to recover the sub-computation task in the failed node based on the state data of the previous stage of the sub-computation task computed by the failed node.
As a comparative example, in the prior art, when a failed node occurs, the stored state data of the computing task is used to restart the computing task from the state of the previous stage, i.e., the prior art needs to recover the sub-computing tasks in all nodes from the state of the previous stage when one node fails.
According to the prior art scheme, the recovery process is slow since all sub-computation tasks are required to recover from the state of the last stage. Moreover, since the sub-computation tasks are associated with each other, the association of the states is relatively tight, and the entire computation can be correctly recovered by confirming the association between all the sub-computations. Furthermore, as the distributed computation itself uses a plurality of nodes to perform a computation task, if any node fails, the whole computation task is recalculated, which is equivalent to double the failure probability.
According to the distributed computing method provided by the embodiment of the invention, the state data of the sub-computing tasks computed by each node in the stored state data are respectively marked with the corresponding node information, so that when a fault node occurs, the sub-computing tasks computed by the fault node only need to be recalculated according to the state data of the previous stage, and the sub-computing tasks in other nodes do not need to be recalculated, so that the computation amount can be reduced and the time required for fault recovery can be shortened compared with the prior art.
In this embodiment of the present invention, optionally, step S104 includes: when a fault node occurs, extracting the state data of the last stage of the sub-computing task computed by the fault node according to the node information marked on the state data; and suspending the sub-computation tasks in the other nodes until the sub-computation task computed by the failed node has been recovered in the normal node according to the state data of the last stage of the sub-computation task computed by the failed node.
In the distributed computing process, one sub-computation may depend on other sub-computations, and therefore, in the present embodiment, when a failed node occurs, the sub-computation tasks in other nodes are suspended, and when the sub-computation task originally computed by the failed node is recovered in a normal node according to the state data of the last stage of the sub-computation task computed by the failed node, the sub-computation tasks in other nodes are restarted to be computed.
When a fault node occurs and the sub-computation tasks in other nodes are suspended, all data generated by the computation of each node are stored in the current node, and when the sub-computation tasks originally computed by the fault node are recovered in the normal node and the computation of the sub-computation tasks in other nodes is restarted, the data generated by the sub-computation tasks in the computation process are distributed among the nodes.
It should be noted that, although the above embodiment describes a scheme of suspending the sub-computation tasks of other nodes when a failed node occurs, the present invention may not suspend all the sub-computation tasks in the normal nodes. For example, the present invention may employ the following modifications.
According to a modification of the present invention, for example, in step S102, the state data of the computation task is stored in stages, and the state data of the sub-computation task computed by each node in the stored state data is respectively labeled with the corresponding node information, and the state data of the sub-computation task computed by each node in the stored state data is respectively labeled with the sub-computation task information.
That is, in the present modification, not only the state data of the sub-computation tasks computed by each node in the stored state data is labeled with the corresponding node information to label which node the state of each sub-computation task in the stored state data was computed, but also the state data of each sub-computation task in the stored state data is labeled with the sub-computation task information to describe to which sub-computation it belongs.
In a distributed computing process, one sub-computing task may be closely related to other sub-computing tasks, e.g., one sub-computing task may need to obtain computing result data of another sub-computing task to start computing.
However, not all the sub-computation tasks are related, according to the distributed computation method of the modified example of the present invention, node information and sub-computation task information are marked on the stored state data, and when a failed node occurs, it is only necessary to judge which sub-computation tasks are being computed in the failed node according to the node information and the sub-computation task information respectively marked on the state data of the sub-computation tasks computed by the failed node, thereby judging whether the sub-computation tasks related to the sub-computation tasks affected by the failed node exist in other normal nodes.
According to the modified example of the invention, the sub-computing tasks related to the sub-computing tasks affected by the fault node are only required to be screened out, and the sub-computing tasks in the fault node and the nodes with the related sub-computing tasks are suspended, and the sub-computing tasks in all the nodes are not required to be suspended, so that the computing time can be saved. When the sub-computation tasks originally computed by the fault node are recovered in the normal node according to the state data of the last stage of the sub-computation tasks computed by the fault node, the sub-computation tasks originally computed by the fault node and the sub-computation tasks in the nodes having the sub-computation tasks related to the sub-computation tasks computed by the fault node are restarted.
In the present modification, when suspending the sub-computation tasks in the failed node and the node in which the related sub-computation task exists, the data generated by the computation of each node may be stored in the current node, and the data of the other nodes that normally compute may be distributed normally. When the sub-computation tasks originally computed by the failed node are recovered in the normal nodes according to the state data of the last stage of the sub-computation tasks computed by the failed node, the sub-computation tasks originally computed by the failed node and the sub-computation tasks computed by the nodes with the related sub-computation tasks are restarted, and data generated by computation are distributed among the nodes.
Optionally, in the embodiment of the present invention, the node periodically sends the status data to the management node, and when the management node does not receive the status data of the node, it is determined that the node has a fault.
Fig. 2 shows a schematic diagram of the main modules of an apparatus 200 for distributed computing according to an embodiment of the invention. As shown in fig. 2, the apparatus 200 for distributed computing according to the embodiment of the present invention includes:
a task obtaining module 201, which obtains calculation task data to be calculated;
a task allocation module 202, configured to split the computing task data acquired by the task acquisition module 201 into multiple sub-computing task data, allocate the multiple sub-computing task data to multiple nodes, and compute multiple sub-computing tasks through the multiple nodes;
a state storage module 203 for storing the states of the computation tasks in stages and marking the state data of the sub-computation tasks computed by each node in the stored state data with corresponding node information; and
a failure recovery module 204, which extracts state data of a previous stage of the sub-computation task computed by the failed node according to the node information marked for the state data when the failed node occurs, and recovers the sub-computation task computed by the failed node in a normal node according to the state data of the previous stage of the sub-computation task computed by the failed node.
FIG. 3 illustrates an exemplary system architecture 300 of a distributed computing method or apparatus to which embodiments of the invention may be applied.
As shown in fig. 3, the system architecture 300 may include terminal devices 301, 302, 303, a network 304, and a server 305. The network 304 serves as a medium for providing communication links between the terminal devices 301, 302, 303 and the server 305. Network 304 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal device 301, 302, 303 to interact with the server 305 via the network 304 to receive or send messages or the like. The terminal devices 301, 302, 303 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).
The terminal devices 301, 302, 303 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 305 may be a server providing various services, such as a background management server (for example only) providing support for shopping-like websites browsed by users using the terminal devices 301, 302, 303. The backend management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (for example, target push information, product information — just an example) to the terminal device.
It should be noted that the distributed computing method provided by the embodiment of the present invention is generally executed by the server 305, and accordingly, the distributed computing apparatus is generally disposed in the server 305.
It should be understood that the number of terminal devices, networks, and servers in fig. 3 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 4, a block diagram of a computer system 400 suitable for use with a terminal device implementing an embodiment of the invention is shown. The terminal device shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 4, the computer system 400 includes a Central Processing Unit (CPU)401 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage section 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the system 400 are also stored. The CPU 401, ROM 402, and RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.
The following components are connected to the I/O interface 405: an input section 406 including a keyboard, a mouse, and the like; an output section 407 including a display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 408 including a hard disk and the like; and a communication section 409 including a network interface card such as a LAN card, a modem, or the like. The communication section 409 performs communication processing via a network such as the internet. A driver 410 is also connected to the I/O interface 405 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 410 as necessary, so that a computer program read out therefrom is mounted into the storage section 408 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 409, and/or installed from the removable medium 411. The computer program performs the above-described functions defined in the system of the present invention when executed by a Central Processing Unit (CPU) 401.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a task acquisition module, a task allocation module, a state storage module, and a failure recovery module. The names of these modules do not in some cases form a limitation on the modules themselves, and for example, the task assignment module may also be described as a "module that assigns computing tasks to nodes".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise:
acquiring computing task data to be computed;
splitting the computing task data into multiple sub-computing task data, distributing the multiple sub-computing task data to multiple nodes, and computing the multiple sub-computing tasks through the multiple nodes;
storing the state data of the computing task in stages, and marking the state data of the sub-computing tasks computed by each node in the stored state data with corresponding node information respectively; and
when a failed node occurs, extracting state data of a previous stage of the sub-computation task computed by the failed node according to the node information marked for the state data, and restoring the sub-computation task computed by the failed node in the normal node according to the state data of the previous stage of the sub-computation task computed by the failed node.
According to the technical scheme of the embodiment of the invention, the state data of the sub-computation tasks computed by each node in the stored state data are respectively marked with the corresponding node information, so that when a failed node occurs, the sub-computation tasks computed by the failed node only need to be recalculated according to the state data of the previous stage, and the sub-computation tasks in other nodes do not need to be recalculated, so that the time and the computation workload required by fault recovery can be reduced compared with the prior art.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A method of distributed computing, comprising:
acquiring computing task data to be computed;
splitting the computing task data into multiple sub-computing task data, distributing the multiple sub-computing task data to multiple nodes, and computing multiple sub-computing tasks through the multiple nodes;
storing state data of the computing tasks in stages, and marking the state data of the sub-computing tasks computed by each node in the stored state data with corresponding node information respectively; and
when a failed node occurs, extracting state data of a previous stage of a sub-computation task computed by the failed node according to the node information marked on the state data, and recovering the sub-computation task computed by the failed node in a normal node according to the state data of the previous stage of the sub-computation task computed by the failed node.
2. The method of distributed computing according to claim 1,
the step of, when a failed node occurs, extracting state data of a previous stage of a sub-computation task computed by the failed node based on the node information marked for the state data, and restoring, in a normal node, the sub-computation task computed by the failed node based on the state data of the previous stage of the sub-computation task computed by the failed node, includes:
when a fault node occurs, extracting state data of a previous stage of a sub-computation task computed by the fault node according to the node information marked by the state data; and
suspending the sub-computation tasks computed by the other nodes until the sub-computation tasks computed by the failed node have been recovered in the normal node from the state data of the previous stage of the sub-computation tasks computed by the failed node.
3. The method of distributed computing according to claim 2,
the step of suspending the sub-computation tasks computed by the other nodes until the sub-computation task computed by the failed node has been recovered in the normal node from the state data of the last stage of the sub-computation task computed by the failed node comprises:
suspending the sub-computation tasks computed by other nodes, and respectively storing all data generated by the computation of each node in the current node; and
when the sub-computation tasks computed by the failed node have been restored in the normal node, the sub-computation tasks computed by the other nodes are started, and data resulting from the computation is distributed among the nodes.
4. The method of distributed computing according to claim 1,
the step of storing the state data of the computation task in stages and marking the state data of the sub-computation tasks computed by each node in the stored state data with corresponding node information respectively comprises the following steps:
and storing the state data of the computing task in stages, and marking the state data of the sub-computing task computed by each node in the stored state data with sub-computing task information respectively.
5. The method of distributed computing according to claim 4,
the step of, when a failed node occurs, extracting state data of a previous stage of a sub-computation task computed by the failed node based on the node information marked for the state data, and restoring, in a normal node, the sub-computation task computed by the failed node based on the state data of the previous stage of the sub-computation task computed by the failed node, includes:
when a fault node occurs, judging whether a sub-computation task related to the sub-computation task influenced by the fault node exists in other nodes or not according to node information and sub-computation task information respectively marked on state data of the sub-computation task computed by the fault node, and screening out nodes with the related sub-computation tasks; and
suspending the sub-computation tasks in the failed node and the node in which the related sub-computation task exists until the sub-computation task computed by the failed node has been recovered in the normal node according to the state data of the last stage of the sub-computation task computed by the failed node.
6. The method of distributed computing according to claim 5,
the step of suspending the sub-computation tasks in the failed node and the node in which the related sub-computation task exists until the sub-computation task computed by the failed node has been recovered in the normal node from the state data of the last stage of the sub-computation task computed by the failed node includes:
suspending the sub-computing tasks in the fault node and the node with the related sub-computing tasks, and respectively storing data generated by computing each node in the current node; and
when the sub-computation task computed by the failed node has been restored in the normal node based on the state data of the last stage of the sub-computation task computed by the failed node, the sub-computation task computed by the failed node and the sub-computation task computed by the node where the related sub-computation task exists are started, and data resulting from the computation is distributed among the nodes.
7. The method of distributed computing according to any of claims 1-6,
the node periodically sends state data to the management node, and when the management node does not receive the state data of the node, the node is judged to have a fault.
8. An apparatus for distributed computing, comprising:
the task acquisition module acquires computing task data to be computed;
the task allocation module divides the computing task data acquired by the task acquisition module into a plurality of sub-computing task data, allocates the plurality of sub-computing task data to a plurality of nodes, and calculates a plurality of sub-computing tasks through the plurality of nodes;
the state storage module stores the states of the computing tasks in stages and marks corresponding node information on the state data of the sub-computing tasks computed by each node in the stored state data respectively; and
a fault recovery module that extracts, when a faulty node occurs, state data of a previous stage of a sub-computation task calculated by the faulty node according to the node information marked for the state data, and recovers, in a normal node, the sub-computation task calculated by the faulty node according to the state data of the previous stage of the sub-computation task calculated by the faulty node.
9. An electronic device for distributed computing, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.
10. A computer-readable medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method according to any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110144149.7A CN113760469A (en) | 2021-02-02 | 2021-02-02 | Distributed computing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110144149.7A CN113760469A (en) | 2021-02-02 | 2021-02-02 | Distributed computing method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113760469A true CN113760469A (en) | 2021-12-07 |
Family
ID=78786591
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110144149.7A Pending CN113760469A (en) | 2021-02-02 | 2021-02-02 | Distributed computing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113760469A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114564305A (en) * | 2022-02-18 | 2022-05-31 | 苏州浪潮智能科技有限公司 | Control method, device and equipment for distributed inference and readable storage medium |
CN117075802A (en) * | 2023-08-04 | 2023-11-17 | 上海虹港数据信息有限公司 | Distributed storage method based on artificial intelligence power |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105573824A (en) * | 2014-10-10 | 2016-05-11 | 腾讯科技(深圳)有限公司 | Monitoring method and system of distributed computing system |
CN107015853A (en) * | 2016-10-10 | 2017-08-04 | 阿里巴巴集团控股有限公司 | The implementation method and device of phased mission system |
US20180285211A1 (en) * | 2017-04-04 | 2018-10-04 | International Business Machines Corporation | Recovering a failed clustered system using configuration data fragments |
CN110018926A (en) * | 2018-11-22 | 2019-07-16 | 阿里巴巴集团控股有限公司 | Fault recovery method, device, electronic equipment and computer readable storage medium |
-
2021
- 2021-02-02 CN CN202110144149.7A patent/CN113760469A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105573824A (en) * | 2014-10-10 | 2016-05-11 | 腾讯科技(深圳)有限公司 | Monitoring method and system of distributed computing system |
CN107015853A (en) * | 2016-10-10 | 2017-08-04 | 阿里巴巴集团控股有限公司 | The implementation method and device of phased mission system |
US20180285211A1 (en) * | 2017-04-04 | 2018-10-04 | International Business Machines Corporation | Recovering a failed clustered system using configuration data fragments |
CN110018926A (en) * | 2018-11-22 | 2019-07-16 | 阿里巴巴集团控股有限公司 | Fault recovery method, device, electronic equipment and computer readable storage medium |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114564305A (en) * | 2022-02-18 | 2022-05-31 | 苏州浪潮智能科技有限公司 | Control method, device and equipment for distributed inference and readable storage medium |
CN117075802A (en) * | 2023-08-04 | 2023-11-17 | 上海虹港数据信息有限公司 | Distributed storage method based on artificial intelligence power |
CN117075802B (en) * | 2023-08-04 | 2024-05-10 | 上海虹港数据信息有限公司 | Distributed storage method based on artificial intelligence power |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109245908B (en) | Method and device for switching master cluster and slave cluster | |
CN109032796B (en) | Data processing method and device | |
CN111897633A (en) | Task processing method and device | |
CN111880967A (en) | File backup method, device, medium and electronic equipment in cloud scene | |
CN111460129A (en) | Method and device for generating identification, electronic equipment and storage medium | |
CN110659124A (en) | A message processing method and device | |
CN111181765A (en) | Task processing method and device | |
CN111338834B (en) | Data storage method and device | |
CN113760469A (en) | Distributed computing method and device | |
CN112069137A (en) | Method and device for generating information, electronic equipment and computer readable storage medium | |
CN114064438A (en) | Database fault processing method and device | |
CN109144991B (en) | Method and device for dynamic sub-metering, electronic equipment and computer-storable medium | |
CN113742376B (en) | A method for synchronizing data, a first server, and a system for synchronizing data | |
CN113760522A (en) | Task processing method and device | |
CN111767126A (en) | System and method for distributed batch processing | |
CN112817687A (en) | Data synchronization method and device | |
CN114070889B (en) | Configuration method, traffic forwarding device, storage medium, and program product | |
CN111767495A (en) | Method and system for synthesizing webpage | |
CN113596172B (en) | Method and device for updating nodes in distributed cluster | |
CN113076175B (en) | Memory sharing method and device for virtual machine | |
CN114265605A (en) | Version rollback method and device for functional component of business system | |
CN112306746B (en) | Method, apparatus and computer program product for managing snapshots in an application environment | |
CN113746661A (en) | Service processing method and device | |
CN113760528A (en) | Resource processing method and device based on multi-cloud platform | |
CN113360689A (en) | Image retrieval system, method, related device and computer program product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |