Detailed Description
In the description of the present application, "/" means that the related objects are in a "or" relationship, for example, a/B may mean a or B, and "and/or" in the present application is merely an association relationship describing the related objects, means that three relationships may exist, for example, a and/or B, and that three cases of a alone, a and B together, and B alone exist, wherein a, B may be singular or plural, unless otherwise stated.
In the description of the present application, unless otherwise indicated, "a plurality" means two or more than two. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (a, b, or c) of a, b, c, a-b, a-c, b-c, or a-b-c may be represented, wherein a, b, c may be single or plural.
In addition, in order to facilitate the clear description of the technical solution of the embodiments of the present application, in the embodiments of the present application, the words "first", "second", etc. are used to distinguish the same item or similar items having substantially the same function and effect. It will be appreciated by those of skill in the art that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ.
In embodiments of the application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion that may be readily understood.
It is appreciated that reference throughout this specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the various embodiments are not necessarily all referring to the same embodiment throughout the specification. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present application, the sequence number of each process does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present application.
It can be appreciated that some optional features of the embodiments of the present application may be implemented independently in some scenarios, independent of other features, such as the scheme on which they are currently based, to solve corresponding technical problems, achieve corresponding effects, or may be combined with other features according to requirements in some scenarios. Accordingly, the device provided in the embodiment of the present application may also implement these features or functions accordingly, which will not be described herein.
In the present application, the same or similar parts between the embodiments may be referred to each other unless specifically stated otherwise. In the present application, unless specifically stated or logically conflicting, terms and/or descriptions between different embodiments are consistent and mutually referents, and different embodiments may be combined to form new embodiments according to their inherent logical relationship. The following embodiments of the present application are not intended to limit the scope of the present application.
In order to facilitate understanding of the technical solution of the embodiments of the present application, a brief description of the related art of the present application is given below:
Rolling upgrade (rolling upgrade) refers to upgrading components one by one or batch by batch in a certain order in a distributed IT system. In the rolling upgrading process, the new version of software and the old version of software exist in the system components at the same time, the number of the old version of components is gradually reduced to zero, and the number of the new version of components is gradually increased until the whole system is fully covered.
Load balancing (load balancing) refers to that when a plurality of data packets need to be forwarded to a plurality of processes, the forwarding success can be ensured, and the load of the data packets received by the processes in a normal state is balanced as a whole.
As shown in FIG. 1, the main architecture of the existing distributed storage system comprises a client side, a server side and a master side, wherein the client side can be a plurality of client clients, the server side can be in the form of a server cluster, such as a plurality of worker nodes, and the master side can be in the form of a master cluster, such as a master and a slave master. The master side may communicate with the view data persistence system.
The client near the user side can receive Input and Output (IO) data issued by the user, call a globally consistent route view and send the IO data to a worker node on the specific server side.
The server cluster is composed of a large number of worker nodes, and each worker node can receive data from clients and persist the received data to a local persistence medium, for example, a mechanical disk or a solid state disk.
The master cluster is composed of a plurality of nodes with management functions, and high availability (high availability, HA) is realized in a master-slave deployment mode. The master keeps heartbeat connection and data transmission connection with each worker node in the server cluster, and synchronously writes key cluster view data and IO view data into a persistence system of a third party, such as a view data persistence system. The master health information is monitored at the master standby moment, and when the master is in fault, the master standby can quickly take over the work of the management server cluster.
For example, as shown in fig. 2, the existing server storage cluster is upgraded in a load sharing manner. The method comprises the following steps:
At the time T1, two working nodes of the worker1 and the worker2 are in an old version state which needs to be upgraded, and the client calls the route view v1 and sends IO data to the worker1.
At the time of T2, the client updates the route view to the route view v2, and changes the target node issued by the IO data from the worker1 to the worker2. During the period that the client normally issues IO data, the worker1 upgrades according to the normal flow, stops the machine first and then re-pulls up the new version.
At the time T3, the upgrading of the worker1 is completed, the client updates the route view to the route view v3, the destination node issued by the IO data is changed back to the worker1 again, and the worker2 performs shutdown upgrading restarting operation.
For example, as shown in fig. 3, for a Master management cluster, an active-standby switching mode is generally adopted for upgrading, which is specifically as follows:
At time T1, the master cluster comprises a main master and a standby master, and operation and maintenance personnel directly stop and upgrade the process master 2. Because the server storage clusters all elect master1 to be a leader at this time and keep connection with master1, the normal functions of the management clusters are not affected by the shutdown upgrade of master 2. In addition, the master writes the cluster-critical metadata information into the persistence system in real time, such as a database or a zookeeper.
And at the time T2, after the new version of the master2 is upgraded and ready, the key metadata information written by the master1 is read from the metadata persistence system, and then the operation and maintenance personnel perform shutdown upgrading on the master 1. After determining that the master1 fails, the server cluster elects the master2 as a leader and establishes a connection with the master 2.
At time T3, master1 is ready for upgrade and runs in a standby master mode. At this time, in the management cluster, both the master and the slave masters complete the upgrade.
In distributed storage systems, existing upgrade schemes have a number of problems. For example, during a shutdown restart, the resources of the server are not available. In the process of stopping the process and then pulling up the process, the server cannot provide services to the outside, host resources cannot be used by the service, and the utilization rate of the resources is reduced. In addition, for reducing the IO performance of the foreground, in the process of load sharing and upgrading, a certain delay exists in updating the route view by the client, when the client uses an un-updated view to send a request to the upgrading node, the client continuously tries to send the request after the failure because the route view is wrong, and IO can not be normally carried out until the client is updated to a new route view, so that the load sharing and upgrading can influence the IO performance of the foreground.
In addition, frequent updates of metadata place a great stress on the metadata persistence system during upgrades. The master cluster writes key cluster metadata into third-party storage systems, such as a database or a zookeeper, and in the upgrading process, a great deal of third-party systems are required to be read and written for frequently updating the route view, so that a certain pressure is caused on the systems.
In the process of system upgrading, the high-availability distributed storage system needs to consider various influences, such as the influence of system upgrading on the function of the foreground IO, the influence of system upgrading on the performance of the foreground IO is reduced, and whether the system upgrading is easy to quickly fault rollback or not. Based on the three points, the existing distributed storage system (such as a disc Gu Wenjian system of the ali cloud and GFS of google) adopts a set of stable storage system upgrading scheme, adopts a load sharing mode for a server storage cluster and adopts a main/standby switching mode for a master management cluster.
In the existing storage system upgrading scheme, a load sharing mode can modify IO route view information of a storage cluster, a client fails to retry during view data synchronization, and the performance of a storage system is reduced, on the other hand, a main/standby switching mode needs to be frequently communicated with a view data persistence system, and the read-write pressure on the persistence system is increased. Secondly, the existing upgrading mode needs to stop the old version process first, then pull up the new version process, and then a single server does not provide any service between the two operations, so that server resources are wasted.
Therefore, an upgrading method for ensuring that a single server provides continuous service to the outside is required to improve the resource utilization rate, wherein the upgrading process has small influence on the IO performance of a foreground and has small pressure on a video data persistence system.
Based on the above, the embodiment of the application provides a rolling upgrading method, which comprises the steps of calling a first process to execute a target task, writing first task state information into a shared memory, wherein the first task state information is related state information of the first process to execute the target task, calling a second process to read the first task state information from the shared memory, calling the second process to execute the target task according to the first task state information, and stopping the first process under the condition that the first process stops processing first task data of the target task, wherein the first task data is task data received by calling the first process. The second process executes the target task according to the first task state information of the first process, and the new process and the old process coexist during upgrading, so that the problem that the server resources are unavailable during upgrading is solved, and the resource utilization rate is improved.
The rolling upgrading method provided by the embodiment of the application can be applied to the distributed storage system shown in fig. 4, and the main architecture of the distributed storage system comprises a client side, a server side and a master side, wherein the client side can be a plurality of client clients, the server side can be in the form of a server cluster, such as a plurality of worker nodes, and the master side can be in the form of a master cluster, such as a master and a standby master. The master side may communicate with the view data persistence system.
The client near the user side can receive Input and Output (IO) data issued by the user, call a globally consistent route view and send the IO data to a worker node on the specific server side.
The server cluster is composed of a large number of worker nodes, and each worker node can receive data from clients and persist the received data to a local persistence medium, for example, a mechanical disk or a solid state disk.
The master cluster is composed of a plurality of nodes with management functions, and high availability (high availability, HA) is realized in a master-slave deployment mode. The master keeps heartbeat connection and data transmission connection with each worker node in the server cluster, and synchronously writes key cluster view data and IO view data into a persistence system of a third party, such as a view data persistence system. The master health information is monitored at the master standby moment, and when the master is in fault, the master standby can quickly take over the work of the management server cluster.
The rolling upgrading method provided by the embodiment of the application can be suitable for rolling upgrading by taking the worker node in the distributed storage system shown in fig. 4 as a storage node, and can also be suitable for rolling upgrading by taking the master node in the distributed storage system shown in fig. 4 as a management node.
In the embodiment of the application, the rolling upgrading method can be applied to a distributed storage system, and can be applied to a system for providing other services, such as a distributed computing system for providing computing services, for example, a distributed network system for providing a network, a database system and the like, and the embodiment of the application is mainly applied to the process change of a node, but under the scene of continuously providing services, the rolling upgrading method can be applied to other situations, for example, after the first process fails, a second process with the same version as the first process is adopted to replace the first process, and the rolling upgrading method provided by the embodiment of the application is adopted to realize the external continuous service in the process.
It is to be understood that, in the embodiments of the present application, the execution subject may perform some or all of the steps in the embodiments of the present application, these steps or operations are only examples, and the embodiments of the present application may also perform other operations or variations of the various operations. Furthermore, the various steps may be performed in a different order presented in accordance with embodiments of the application, and it is possible that not all of the operations in the embodiments of the application may be performed.
It should be noted that, in the following embodiments of the present application, names of messages between devices or names of parameters in a message are merely examples, and may be other names in specific implementations, which are not limited in particular by the embodiments of the present application.
In the embodiment of the application, the rolling upgrading method can be applied to various scenes, and the upgrading of the storage nodes and the management nodes in the distributed storage system is illustrated as follows:
As shown in fig. 5, a rolling upgrade method provided in an embodiment of the present application includes the following steps:
in an offline stage, the rolling upgrade method provided by the embodiment of the application can execute the following steps:
501. an update to the first process is triggered.
The network device is triggered to update the upgrade of the first process, and the first process is scheduled to be updated to the second process, wherein the first process is the process of the old version before the upgrade, and the second process is the process of the new version after the upgrade.
In the embodiment of the present application, the update and upgrade of the first process may be triggered by an operation and maintenance manager or a network device system, which is not limited herein.
In the embodiment of the application, in the distributed storage system, the network device may be a storage node, for executing a storage task, where the first process and the second process are both storage processes. The network device may also be a management node for performing management tasks, where the first process and the second process are both management processes, and may also be other types of devices for implementing other corresponding functions, which are not limited herein.
502. And writing the first task state information into the shared memory.
The network device invokes the first process to execute the target task, and after triggering the update of the first process, the network device can write first task state information into the shared memory, where the first task state information is related state information of the first process to execute the target task.
In the embodiment of the application, after the update of the first process is triggered, the first process continues to execute the target task, but the state information of the key operation time for executing the task is written into the shared memory, and the state information of the first task indicates the execution state of the first process for the target task. For example, when the process is a storage process, the first task state information may indicate a state of execution for a target storage task, for example, the target storage task needs to store 10 data packets in a storage medium, and for 6 data packets received by the network device, the first process already stores 5 data packets in the storage medium, and one data packet is in process. At this point, the network device may invoke the first process to store the state information in the shared memory.
In the embodiment of the present application, the shared memory is a shared memory that can be accessed and read by both the first process and the second process, and the shared memory may be a memory in the network device, which is not limited herein.
In a possible implementation manner, the network device is a storage node, and is configured to perform a storage task, and the first process and the second process are both storage processes. The first process may write the first task state information into the shared cache.
In a possible implementation manner, the network device is a management node, and is configured to perform a management task, where the first process and the second process are both management processes. The first management process may write the first task state information into the shared cache, and in addition, the first management process may write metadata information into the shared memory, where the metadata information is metadata information of the entire storage cluster.
In the embodiment of the application, for the management node, the management process can write the first task state information into the shared cache and also can write the metadata information into the shared memory, so that the second management process can acquire the metadata information from the shared cache without accessing the third-party metadata persistence system, and the pressure of upgrading on the third-party persistence system is reduced.
In the embodiment of the application, the third-party persistence system may be a database or a zookeeper persistence system, and the storage medium may be a storage medium such as a magnetic disk or a solid state disk, which is not limited in particular.
503. A second process is created.
The network device creates (pulls up) a second process whose task interface is the same as the task interface of the first process.
Specifically, after triggering the update to the first process, the network device may create a new version of the second process, where the task interface of the second process is the same as the task interface of the first process, i.e. the second process may receive the same task data as the first process.
For example, in the case where the network device is a storage node for performing a storage task and the first process and the second process are both storage processes, the network device may distribute received task data to the respective processes through an Operating System (OS) kernel. When the second process is created, the OS port monitored by the second process is the same as the OS port monitored by the first process. And when socket connection is established, the so-reuseport characteristic is used to realize that a plurality of processes bind the same port.
504. And calling a second process to read the first task state information from the shared memory.
The network equipment calls a second process to read the first task state information from the shared memory, restores the process state and prepares to execute the target task according to the first task state information. After the second process resumes the process state based on the first task state information, the network device may consider the second process ready to execute the target task.
505. And stopping calling the first process to receive the task data of the target task.
After the second process of the network device resumes the process state according to the first task state information, i.e. after the second process is ready to execute the target task, the network device stops invoking the first process to receive the task data of the target task, i.e. the operating system kernel of the network device no longer distributes the task data to the first process.
However, for task data that the first process has received, the first process will continue to process the task data until all processing is completed. For example, for a storage task, for task data that has been received by the first storage process, the first storage process may continue to write the task data to the storage medium until all writing is complete.
506. And calling a second process to execute the target task.
The network device invokes the second process to perform the target task.
Specifically, after the second process of the network device is ready to execute the target task, the network device stops calling the first process to receive task data of the target task, and starts calling the second process to execute the target task. And the second process starts to process task data corresponding to the target task according to the process state recovered by the first task state information.
In a possible implementation manner, the network device is a storage node, and is configured to perform a storage task, and the first process and the second process are both storage processes. After the second storage process is ready, the operating system kernel of the network device distributes the received storage task data to the second storage process, and the second storage process writes the storage task data into the storage medium. The storage task data received by the storage node is IO data issued by a user, the volume of a data packet is large, and the sending and receiving frequency is high.
In a possible implementation manner, the network device is a management node, and is configured to perform a management task, where the first process and the second process are both management processes. After the second storage process is ready, the operating system kernel of the network device distributes the received management task data to the second management process, and the second management process writes the management task data into a third-party persistence system, such as a database or a zookeeper. The management task data received by the management node belongs to a command control type, the volume of the data packet is small, and the sending and receiving frequency is low.
In the embodiment of the application, the second process executes the target task according to the first task state information of the first process, and the new process and the old process coexist during upgrading, so that the problem that the server resource is unavailable during upgrading is solved, and the resource utilization rate is improved.
In the embodiment of the application, the operating system kernel of the network equipment can distribute the load balancing mechanism received by the network equipment to a plurality of processes (a first process and a second process), thereby solving the problem that the foreground IO performance is affected during the period of upgrading the storage node, reducing the risk of service interruption and ensuring the stability and the continuity of the service.
507. The first process is stopped.
And stopping the first process under the condition that the first process stops processing the first task data of the target task, wherein the first task data is the task data received by the first process.
In the embodiment of the present application, after stopping distributing the task data to the first process, the first process further needs to continue to process the received first task data, as described in step 505. After the first process has processed the first task data, the network device may stop (kill) the first process.
In a possible implementation manner, the network device is a storage node, and is configured to perform a storage task, and the first process and the second process are both storage processes. The network device invokes the first storage process to continue processing the received first task data, where the first task data is received before stopping the first storage process from receiving the task data, and after the first storage process finishes processing the first task data, the network device may propose to stop the first storage process. For example, after the first storage process stores all the received first task data in the storage medium, the network device detects that the first storage process no longer writes data into the storage medium, it may determine that the first storage process has completed all the IO requests, and then the network device may stop the first storage process.
In a possible implementation manner, the network device is a management node, and is configured to perform a management task, where the first process and the second process are both management processes. The network device invokes the first management process to continue processing the received first task data, where the first task data is received before stopping the first management process from receiving the task data, and after the first management process finishes processing the first task data, the network device may propose to stop the first management process. For example, after the first management process writes all the received first task data into the third-party persistence system, the network device detects that the first management process no longer writes data into the third-party persistence system, it can determine that the first management process has processed all the first task data, and then the network device can stop the first management process.
In the embodiment of the present application, the network device 800 may be divided into functional modules according to the above method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present application, the division of the modules is schematic, which is merely a logic function division, and other division manners may be implemented in actual implementation.
Fig. 8 shows a possible structural diagram of the network device 800 involved in the above-described embodiment in the case where respective functional blocks are divided with corresponding respective functions. As shown in fig. 8, the network device 800 includes:
an updating module 801, configured to trigger updating the first process to the second process. For example, step 501, triggers an update to the first process.
The first execution module 802 is configured to invoke the first process to execute the target task, for example, step 502 is to write the status information of the first task into the shared memory.
The writing module 803 is configured to write first task state information into the shared memory, where the first task state information is related state information of a task that the first process executes, for example, step 502 is to write the first task state information into the shared memory.
In a possible implementation manner, the target task is a management task, and the writing module 803 is specifically configured to write the first task state information and the metadata information into the shared memory. For example, step 502, the first task state information is written into the shared memory.
The creating module 804 is configured to create a second process, where a task interface of the second process is the same as a task interface of the first process. For example, step 503, creates a second process.
And a reading module 805, configured to invoke the second process to read the first task state information from the shared memory, for example, in step 504, invoke the second process to read the first task state information from the shared memory.
And a distributing module 806, configured to, when the task data of the target task is stopped to be distributed to the first process by the operating system kernel, distribute the task data of the target task to the second process by the operating system kernel. For example, step 506, invokes the second process to perform the target task.
A second stopping module 807 for stopping, by the operating system kernel, the distribution of the task data of the target task to the first process. For example, step 505, stops invoking the first process to receive task data for the target task.
And a processing module 808, configured to invoke the first process to process the first task data of the target task. For example, step 506, invokes the second process to perform the target task.
And a second execution module 809, configured to invoke the second process to execute the target task according to the first task state information, for example, step 506, and invoke the second process to execute the target task.
In a possible implementation manner, the second execution module 809 includes:
and a restoring unit 810, configured to invoke the second process to restore the task process state according to the first task state information, for example, in step 506, invoke the second process to execute the target task.
And the execution unit 811 is used for calling the second process to execute the target task according to the task process state. For example, step 506, invokes the second process to perform the target task.
The first stopping module 812 is configured to stop the first process when the first process stops processing the first task data of the target task, where the first task data is the task data received by the first process. For example, step 507, the first process is stopped.
The modules of the nonlinear compensation device may also be used to perform other actions in the above method embodiments, and all relevant content of each step related to the above method embodiments may be cited to functional descriptions of corresponding functional modules, which are not described herein.
Fig. 9 is a schematic diagram of a network device according to an embodiment of the present application, where the network device 900 may include one or more central processing units (central processing units, CPU) 901 and a memory 905, where one or more application programs or data are stored in the memory 905.
Wherein the memory 905 may be volatile storage or persistent storage. The program stored in the memory 905 may include one or more modules, each of which may include a series of instruction operations in the network device. Still further, the central processor 901 may be arranged to communicate with the memory 905 to execute a series of instruction operations in the memory 905 on the network device 900.
The central processor 901 is configured to execute a computer program in the memory 905, so that the network device 900 is configured to execute a first process to execute a target task, write first task state information into the shared memory, where the first task state information is related state information of the first process to execute the target task, call a second process to read the first task state information from the shared memory, call a second process to execute the target task according to the first task state information, and stop the first process when the first process stops processing first task data of the target task, where the first task data is task data received by the first process. For specific implementation, please refer to steps 501-507 in the embodiment shown in fig. 5, which is not described herein.
The network device 900 may also include one or more power supplies 902, one or more wired or wireless network interfaces 903, one or more input output interfaces 904, and/or one or more operating systems, such as Windows ServerTM, mac OS XTM, unixTM, linuxTM, freeBSDTM, or the like.
The network device 900 may perform the operations performed by the network device in the embodiment shown in fig. 5, which are not described herein.
Embodiments of the present application also provide a computer program product comprising instructions. The computer program product may be a software or program product containing instructions capable of running on a plug-in result multiplexing device or stored in any available medium. The computer program product, when run on the plug-in result multiplexing means, causes the network device to perform the rolling upgrade method performed in the embodiment shown in fig. 5 described above.
The embodiment of the application also provides a computer readable storage medium. The computer readable storage medium may be any available medium that can be stored by the cache server or a data storage device such as a data center containing one or more available media. Usable media may be magnetic media (e.g., floppy disks, hard disks, magnetic tape), optical media (e.g., DVD), or semiconductor media (e.g., solid state disk), among others. The computer readable storage medium includes instructions that instruct the network device to perform the rolling upgrade method performed in the embodiment shown in fig. 5 described above.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using a software program, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or a portion of the flow (or functionality) of embodiments of the present application is implemented. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, a website, computer, server, or data center via a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. Computer readable storage media can be any available media that can be accessed by a computer or data storage devices including one or more servers, data centers, etc. that can be integrated with the media. Usable media may be magnetic media (e.g., floppy disks, hard disks, magnetic tape), optical media (e.g., DVD), or semiconductor media (e.g., solid State Disk (SSD)) or the like. In an embodiment of the application, the computer may comprise the foregoing apparatus.
Although the application is described herein in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed application, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.