CN106502835B

CN106502835B - A kind of disaster-tolerant backup method and device

Info

Publication number: CN106502835B
Application number: CN201610943435.9A
Authority: CN
Inventors: 韩笑; 郝建明; 宋泽锋; 伍福生; 简超; 潘星明; 李兴锋
Original assignee: China Unionpay Co Ltd
Current assignee: China Unionpay Co Ltd
Priority date: 2016-10-26
Filing date: 2016-10-26
Publication date: 2018-10-16
Anticipated expiration: 2036-10-26
Also published as: CN106502835A

Abstract

The present invention provides a disaster recovery backup method and device. The disaster recovery backup method includes: judging whether the backup node is in the data synchronization interrupted state; when the backup node is in the data synchronization interrupted state, judging the last data update time in the master node Whether the time difference with the system time is less than the first time threshold; if the time difference between the time of the last updated data in the master node and the system time is less than the first time threshold, start the backup node to synchronize data from the master node; poll to detect the backup node, when the backup When the node completes data synchronization, stop the backup node to synchronize data. By using the disaster recovery backup method and device provided by the embodiments of the present invention, each controlled end node can be remotely controlled, and the flow and automation of remote data replication and backup can be realized.

Description

Disaster recovery backup method and device

技术领域technical field

本发明涉及通信技术领域，尤其涉及一种容灾备份方法及装置。The present invention relates to the technical field of communications, in particular to a disaster recovery backup method and device.

背景技术Background technique

随着信息技术的快速发展，信息安全越来越成为各行业各领域中广受关注的问题，虽然计算机的发展已经为人们提供了比以往任何时候更实时更快捷的信息服务，实现了信息存储和管理的自动化，但是计算机的广泛应用所带来的隐患也是非常大的。一旦计算机系统遭到不可逆的破坏，将会导致巨大损失。为了在灾后快速恢复数据，现有技术中一般采用数据备份的方法，包括手动备份、自动备份等方法，NetApp SnapMirror软件具有经过验证的高效率、精简性及合理成本，因此多年来该软件一直是在各种NetApp存储环境中进行复制和灾难恢复的首选技术，虽然NetApp SnapMirror可以实现数据高效的异地复制及备份，但是需要人工手动控制数据的异地传输。With the rapid development of information technology, information security has increasingly become an issue of widespread concern in various industries and fields. Although the development of computers has provided people with more real-time and faster information services than ever before, and realized information storage. And the automation of management, but the hidden dangers brought by the wide application of computers are also very large. Once the computer system is irreversibly damaged, it will cause huge losses. In order to quickly restore data after a disaster, data backup methods are generally used in the prior art, including manual backup and automatic backup. NetApp SnapMirror software has proven high efficiency, simplicity and reasonable cost, so this software has been used for many years The preferred technology for replication and disaster recovery in various NetApp storage environments. Although NetApp SnapMirror can achieve efficient off-site replication and backup of data, it requires manual control of data off-site transmission.

发明内容Contents of the invention

为解决上述技术问题，本发明提供了一种容灾备份方法及装置。In order to solve the above technical problems, the present invention provides a disaster recovery backup method and device.

本发明一方面提供了一种容灾备份方法，所述容灾备份方法包括：One aspect of the present invention provides a disaster recovery backup method, the disaster recovery backup method comprising:

判断备份节点是否处于数据同步中断状态；Determine whether the backup node is in the state of data synchronization interruption;

当所述备份节点处于数据同步中断状态时，判断主节点中最后更新数据的时间与系统时间的时间差是否小于第一时间阈值；When the backup node is in the data synchronization interruption state, it is judged whether the time difference between the time of last updating data in the master node and the system time is less than the first time threshold;

如果所述主节点中最后更新数据的时间与系统时间的时间差小于第一时间阈值，启动所述备份节点从所述主节点处同步数据；If the time difference between the last data update time in the master node and the system time is less than a first time threshold, start the backup node to synchronize data from the master node;

轮询检测所述备份节点，当所述备份节点完成数据同步时，停止所述备份节点同步数据。The backup node is polled and detected, and when the backup node completes data synchronization, the backup node is stopped from synchronizing data.

在一实施例中，当所述备份节点完成数据同步时，若无法停止所述备份节点，所述容灾备份方法还包括：轮询检测所述主节点一预设时间，如果所述主节点在所述预设时间内停止接收外部数据，则停止所述备份节点同步数据。In an embodiment, when the backup node completes data synchronization, if the backup node cannot be stopped, the disaster recovery backup method further includes: polling and detecting the master node for a preset time, if the master node Stop receiving external data within the preset time, then stop the backup node from synchronizing data.

在一实施例中，所述容灾备份方法还包括：如果所述主节点的状态在所述预设时间内未停止接收外部数据，则发出告警信息，进行容灾报错。In an embodiment, the disaster recovery backup method further includes: if the state of the master node does not stop receiving external data within the preset time, sending an alarm message for disaster recovery and error reporting.

在一实施例中，当所述备份节点完成数据同步时，所述容灾备份方法还包括：In an embodiment, when the backup node completes data synchronization, the disaster recovery backup method further includes:

判断所述备份节点中最后同步数据的时间与所述主节点中最后更新数据的时间的时间差是否小于第二时间阈值；judging whether the time difference between the last data synchronization time in the backup node and the last data update time in the master node is less than a second time threshold;

当所述备份节点中最后同步数据的时间与所述主节点中最后更新数据的时间的时间差小于所述第二时间阈值时，停止所述备份节点同步数据。When the time difference between the last data synchronization time in the backup node and the last data update time in the master node is less than the second time threshold, stop the backup node from synchronizing data.

在一实施例中，在所述主节点中最后更新数据的时间与系统时间的时间差小于第一时间阈值，启动所述备份节点从所述主节点处同步数据后，所述容灾备份方法还包括：判断所述备份节点是否仍处于数据同步中断状态，如果所述备份节点仍处于数据同步中断状态，重新启动所述备份节点进行数据同步。In one embodiment, the time difference between the time when the data was last updated on the master node and the system time is less than a first time threshold, and after the backup node is started to synchronize data from the master node, the disaster recovery backup method further The method includes: judging whether the backup node is still in the interrupted state of data synchronization, and restarting the backup node to perform data synchronization if the backup node is still in the interrupted state of data synchronization.

在一实施例中，所述容灾备份方法还包括：在启动所述备份节点后开始计时，并检测所述备份节点当前进行数据同步的整体时间是否大于第三时间阈值，如果是，则发出告警信息，进行容灾报错。In one embodiment, the disaster recovery backup method further includes: starting timing after starting the backup node, and detecting whether the current overall time for data synchronization of the backup node is greater than the third time threshold, and if so, sending Alarm information, for disaster recovery and error reporting.

在一实施例中，所述容灾备份方法还包括：判断备份节点是否处于数据同步中断状态的结果为否时，发出告警信息，进行容灾报错。In one embodiment, the disaster recovery backup method further includes: when the result of judging whether the backup node is in the data synchronization interrupted state is negative, sending out an alarm message for disaster recovery and error reporting.

在一实施例中，所述容灾备份方法还包括：当所述备份节点中最后同步数据的时间与所述主节点中最后更新数据的时间的时间差大于所述第二时间阈值时，使所述备份节点继续同步数据。In an embodiment, the disaster recovery backup method further includes: when the time difference between the last data synchronization time in the backup node and the last data update time in the master node is greater than the second time threshold, making the The above backup node continues to synchronize data.

本发明实施例另一方面还提供了一种容灾备份装置，所述容灾备份装置包括：Another aspect of the embodiment of the present invention also provides a disaster recovery backup device, the disaster recovery backup device includes:

备份节点访问单元，用于判断备份节点是否处于数据同步中断状态；The backup node access unit is used to determine whether the backup node is in a state of data synchronization interruption;

主节点访问单元，用于判断主节点中最后更新数据的时间与系统时间的时间差是否小于第一时间阈值；The master node access unit is used to determine whether the time difference between the last data update time in the master node and the system time is less than the first time threshold;

备份节点启动单元，用于当所述主节点中最后更新数据的时间与系统时间的时间差小于第一时间阈值时，启动所述备份节点从所述主节点处同步数据；A backup node starting unit, configured to start the backup node to synchronize data from the master node when the time difference between the last data update time in the master node and the system time is less than a first time threshold;

轮询控制单元，用于轮询检测所述备份节点，当所述备份节点完成数据同步时，停止所述备份节点同步数据。The polling control unit is configured to poll and detect the backup node, and stop the backup node from synchronizing data when the backup node completes data synchronization.

在一实施例中，当所述备份节点完成数据同步时，若无法停止所述备份节点，所述轮询控制单元还用于：轮询检测所述主节点一预设时间，如果所述主节点在所述预设时间内停止接收外部数据，则停止所述备份节点同步数据。In an embodiment, when the backup node completes data synchronization, if the backup node cannot be stopped, the polling control unit is further configured to: poll the master node for a preset time, if the master node If the node stops receiving external data within the preset time, the backup node is stopped from synchronizing data.

在一实施例中，所述容灾备份装置还包括：报错单元，用于当所述主节点的状态在所述预设时间内未停止接收外部数据时发出告警信息，进行容灾报错。In an embodiment, the disaster recovery backup device further includes: an error reporting unit, configured to send an alarm message when the status of the master node does not stop receiving external data within the preset time, and perform disaster recovery and error reporting.

在一实施例中，所述轮询控制单元还包括：In an embodiment, the polling control unit further includes:

时间差判断模块，用于判断所述备份节点中最后同步数据的时间与所述主节点中最后更新数据的时间的时间差是否小于第二时间阈值；A time difference judging module, configured to judge whether the time difference between the last data synchronization time in the backup node and the last data update time in the master node is less than a second time threshold;

节点中断模块，用于当所述备份节点中最后同步数据的时间与所述主节点中最后更新数据的时间的时间差小于所述第二时间阈值时，停止所述备份节点同步数据。A node interruption module, configured to stop the backup node from synchronizing data when the time difference between the last data synchronization time in the backup node and the last data update time in the master node is less than the second time threshold.

判断模块，当所述备份节点启动单元启动备份节点后，用于判断所述备份节点是否仍处于数据同步中断状态；A judging module, configured to judge whether the backup node is still in a data synchronization interruption state after the backup node startup unit starts the backup node;

重新启动模块，用于当所述备份节点仍处于数据同步中断状态时，重新启动所述备份节点进行数据同步。A restart module, configured to restart the backup node to perform data synchronization when the backup node is still in a data synchronization interruption state.

在一实施例中，所述容灾备份装置还包括一计时单元，用于在启动所述备份节点后开始计时，并检测所述备份节点当前进行数据同步的整体时间是否大于第三时间阈值，当所述备份节点当前进行数据同步的整体时间大于所述第三时间阈值时，所述报错单元发出告警信息，进行容灾报错。In an embodiment, the disaster recovery backup device further includes a timing unit, configured to start timing after starting the backup node, and detect whether the overall time for the backup node to perform data synchronization is greater than a third time threshold, When the current overall time for the backup node to perform data synchronization is greater than the third time threshold, the error reporting unit sends an alarm message for disaster recovery and error reporting.

在一实施例中，所述报错单元还用于当备份节点访问单元的输出结果为否时，发出告警信息，进行容灾报错。In an embodiment, the error reporting unit is further configured to issue an alarm message for disaster recovery and error reporting when the output result of the backup node access unit is negative.

在一实施例中，所述轮询控制单元还包括：更新模块，用于当所述时间差判断模块的输出结果为否时，使所述备份节点继续同步数据。In an embodiment, the polling control unit further includes: an updating module, configured to make the backup node continue to synchronize data when the output result of the time difference judging module is negative.

在一实施例中，所述容灾备份装置设置于所述备份节点中，所述容灾备份装置通过SSH远程访问所述主节点，其中，SSH表示安全外壳协议。In an embodiment, the disaster recovery backup device is set in the backup node, and the disaster recovery backup device remotely accesses the master node through SSH, wherein SSH stands for Secure Shell Protocol.

在一实施例中，所述容灾备份装置设置于所述主节点中，所述容灾备份装置通过SSH远程访问所述备份节点。In an embodiment, the disaster recovery backup device is set in the master node, and the disaster recovery backup device remotely accesses the backup node through SSH.

在一实施例中，所述容灾备份装置于所述备份节点及主节点之外独立设置。In one embodiment, the disaster recovery backup device is set independently from the backup node and the master node.

利用本发明实施例提供的容灾备份方法及装置，可以对各受控端节点进行远程控制，实现异地数据复制及备份的流程化和自动化。By using the disaster recovery backup method and device provided by the embodiments of the present invention, each controlled end node can be remotely controlled, and the flow and automation of remote data replication and backup can be realized.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. Those skilled in the art can also obtain other drawings based on these drawings without creative work.

图1为本发明实施例容灾备份方法的流程示意图；Fig. 1 is a schematic flow diagram of a disaster recovery backup method according to an embodiment of the present invention;

图2为本发明实施例又一容灾备份方法的流程示意图；FIG. 2 is a schematic flow diagram of another disaster recovery backup method according to an embodiment of the present invention;

图3为本发明实施例容灾备份装置的结构示意图；3 is a schematic structural diagram of a disaster recovery backup device according to an embodiment of the present invention;

图4为本发明实施例轮询控制单元4的结构示意图；FIG. 4 is a schematic structural diagram of a polling control unit 4 according to an embodiment of the present invention;

图5为本发明实施例容灾备份装置的架构图；FIG. 5 is a structural diagram of a disaster recovery backup device according to an embodiment of the present invention;

图6为本发明实施例又一容灾系统的控制结构图。FIG. 6 is a control structure diagram of another disaster recovery system according to an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

本发明供了一容灾备份方法的实施例，如图1所示，该方法主要包括以下步骤：The present invention provides the embodiment of a disaster recovery backup method, as shown in Figure 1, this method mainly comprises the following steps:

步骤S1、判断备份节点是否处于数据同步中断状态。Step S1, judging whether the backup node is in a state of data synchronization interruption.

如过备份节点当前处于数据同步中断状态，说明符合继续同步数据的要求，如果步骤S1的输出结果为否，说明备份节点不符合继续同步数据的要求，需要发出告警信息，进行容灾报错(步骤S6)，不再执行以下步骤。If the backup node is currently in the data synchronization interrupted state, it means that it meets the requirements for continuing to synchronize data. If the output result of step S1 is No, it means that the backup node does not meet the requirements for continuing to synchronize data, and it is necessary to send an alarm message for disaster recovery and error reporting (step S6), no longer perform the following steps.

步骤S2、当备份节点处于数据同步中断状态时，进一步判断主节点中最后更新数据的时间与系统时间的时间差是否小于第一时间阈值。Step S2. When the backup node is in the data synchronization interruption state, it is further judged whether the time difference between the last data update time in the master node and the system time is smaller than the first time threshold.

由于在本发明实施例一开始是由主节点来接收外部数据的，而备份节点从主节点中复制数据，其数据源是主节点，通过步骤S2访问主节点，确定主节点最后更新数据的时间与系统时间的时间差是否小于一预设时间阈值，如果是的话，说明主节点中最后更新数据的时间与外部数据源基本同步，可以执行后续步骤。因此可知该步骤的主要作用是确定主节点中所获取的外部数据是否为最新的数据。Since in the embodiment of the present invention, the master node receives external data at the beginning, and the backup node copies data from the master node, and its data source is the master node, accesses the master node through step S2, and determines the time when the master node last updates data Whether the time difference with the system time is less than a preset time threshold, if yes, it means that the last data update time in the master node is basically synchronized with the external data source, and subsequent steps can be performed. Therefore, it can be seen that the main function of this step is to determine whether the external data acquired in the master node is the latest data.

在一实施例中，上述的时间阈值通常取15分钟，即允许主节点中最后更新数据的时间与系统当前时间的时间差在15分钟内，超过15分钟则发出告警信息，进行容灾报错(步骤S6)。In one embodiment, the above-mentioned time threshold is usually taken as 15 minutes, that is, the time difference between the last data update time in the master node and the current system time is allowed to be within 15 minutes, and if it exceeds 15 minutes, an alarm message will be sent to perform disaster recovery and error reporting (step S6).

步骤S3、如果主节点中最后更新数据的时间与系统时间的时间差小于第一时间阈值，启动备份节点从主节点处同步数据。Step S3, if the time difference between the time of last updating data in the primary node and the system time is smaller than the first time threshold, start the backup node to synchronize data from the primary node.

如果主节点中最后更新数据的时间与系统当前时间相比的时间差不超过15分钟，则可以将主节点最后更新的的外部数据作为最新数据，启动备份节点从主节点处进行数据同步。If the time difference between the last data update time on the master node and the current system time does not exceed 15 minutes, you can use the external data last updated by the master node as the latest data, and start the backup node to synchronize data from the master node.

步骤S4、轮询检测备份节点，判断备份节点是否完成数据同步。Step S4, polling and detecting the backup node, and judging whether the backup node has completed data synchronization.

步骤S5、当备份节点完成数据同步完成时，停止备份节点同步数据。Step S5, when the backup node completes data synchronization, stop the backup node from synchronizing data.

本发明实施例通过SSH远程控制备份节点去主节点中获取的外部数据，来保证数据同步到备份节点上，在实施本发明实施例时，只需要对一个备份节点进行管理即可，管理对象简单，实现了异地数据复制及备份的流程化和自动化。The embodiment of the present invention uses SSH to remotely control the backup node to obtain external data from the master node to ensure that the data is synchronized to the backup node. When implementing the embodiment of the present invention, only one backup node needs to be managed, and the management object is simple , realizing the process and automation of off-site data replication and backup.

在一实施例中，当备份节点完成数据同步时，通常地可由步骤S5停止备份节点，使其状态变为数据同步中断状态。但是如果此时主节点正在接收外部数据，即主节点处于数据传输状态，此时即使备份节点完成了数据同步，也有可能无法将其停止，为了防止重复报错，给主节点预设了一缓冲时间，轮询检测主节点该预设缓冲时间，如果主节点的状态在该预设缓冲时间内转换为空闲状态(即停止接收外部数据)，那么才有可能成功关闭备份节点。如果主节点的状态在上述预设时间内未停止接收外部数据，则发出告警信息，进行容灾报错。In one embodiment, when the backup node completes the data synchronization, the backup node can usually be stopped in step S5, and its state changes to the data synchronization interruption state. However, if the master node is receiving external data at this time, that is, the master node is in the data transmission state, even if the backup node completes data synchronization, it may not be able to stop it. In order to prevent repeated error reporting, a buffer time is preset for the master node , polling detects the preset buffer time of the master node, if the state of the master node changes to an idle state (that is, stops receiving external data) within the preset buffer time, then it is possible to successfully shut down the backup node. If the state of the master node does not stop receiving external data within the above preset time, an alarm message will be issued for disaster recovery and error reporting.

图2也为本发明实施例提供的容灾备份方法的流程示意图。如图2所示，在轮询检测备份节点与停止备份节点同步数据之间还包括一判断步骤S7，用于判断备份节点是否仍处于数据同步中断状态，如果备份节点仍处于数据同步中断状态，说明步骤S3未能成功启动备份节点，此时需要重新启动备份节点进行数据同步。FIG. 2 is also a schematic flowchart of a disaster recovery backup method provided by an embodiment of the present invention. As shown in Figure 2, a judgment step S7 is also included between the polling detection backup node and the stop backup node synchronous data, for judging whether the backup node is still in the data synchronization interruption state, if the backup node is still in the data synchronization interruption state, It means that the backup node failed to start successfully in step S3, and the backup node needs to be restarted for data synchronization.

如图2所示，当备份节点完成数据同步时，该容灾备份方法还包括一步骤S8，判断备份节点中最后同步数据的时间与主节点中最后更新数据的时间的时间差是否小于一时间阈值，并当备份节点中最后同步数据的时间与主节点中最后更新数据的时间的时间差小于该时间阈值时，停止备份节点同步数据。该步骤主要用于在备用节点完成数据复制后进行校验，确保备份节点最后更新的数据是主节点中的最新外部数据。当备份节点中最后同步数据的时间与主节点中最后更新数据的时间的时间差大于该时间阈值时，使备份节点继续同步数据。其中，此处的时间阈值可以取5分钟，也可以取15分钟，但一般不小于5分钟。As shown in Figure 2, when the backup node completes data synchronization, the disaster recovery backup method also includes a step S8, judging whether the time difference between the time of the last data synchronization in the backup node and the time of the last data update in the master node is less than a time threshold , and when the time difference between the last data synchronization time in the backup node and the last data update time in the master node is less than the time threshold, stop the backup node from synchronizing data. This step is mainly used to verify after the backup node completes data replication, to ensure that the last updated data of the backup node is the latest external data in the master node. When the time difference between the last data synchronization time in the backup node and the last data update time in the master node is greater than the time threshold, the backup node is made to continue to synchronize data. Wherein, the time threshold here may be 5 minutes or 15 minutes, but generally not less than 5 minutes.

本发明实施例提供的容灾备份方法还包括一计时步骤(图2中未示出)，当启动备份节点从主节点同步数据的时候，为了避免备份节点挂起(即hang死)，本发明实施例还设定了一时间阈值，在启动备份节点的同时开始计时，检测备份节点当前进行数据同步的整体时间是否大于这一时间阈值，如果是的话就进行容灾报错。该时间阈值通常取40分钟，即，备份节点从主节点处进行数据同步的整体过程所用时间应控制在40分钟内，超出此时间阈值说明数据同步过程出现了特殊情况，某些步骤可能发生hang死情况，此阈值的检查是保障整个数据同步过程可控的兜底方案。The disaster recovery backup method provided by the embodiment of the present invention also includes a timing step (not shown in FIG. 2 ). The embodiment also sets a time threshold, starts timing when the backup node is started, and detects whether the current overall time for data synchronization of the backup node is greater than this time threshold, and if so, performs disaster recovery and error reporting. The time threshold is usually set at 40 minutes, that is, the overall process of data synchronization between the backup node and the master node should be controlled within 40 minutes. Exceeding this time threshold indicates that there is a special situation in the data synchronization process, and some steps may hang In case of death, the check of this threshold is a bottom-up solution to ensure that the entire data synchronization process is controllable.

基于与图1及图2所示的容灾备份方法相同的发明构思，发明实施例还提供了一种容灾备份装置，如下面实施例所述。由于该容灾备份装置解决问题的原理与容灾备份方法相似，因此该容灾备份装置的实施可以参见容灾备份方法的实施，重复之处不再赘述。Based on the same inventive concept as the disaster recovery backup method shown in FIG. 1 and FIG. 2 , the embodiment of the invention also provides a disaster recovery backup device, as described in the following embodiments. Since the problem-solving principle of the disaster recovery and backup device is similar to the disaster recovery and backup method, the implementation of the disaster recovery and backup device can refer to the implementation of the disaster recovery and backup method, and the repetition will not be repeated.

图3为本发明实施例容灾备份装置的结构示意图，如图3所示，该容灾备份装置包括：备份节点访问单元1、主节点访问单元2、备份节点启动单元3及轮询控制单元4。其中，备份节点访问单元1用于判断备份节点是否处于数据同步中断状态；主节点访问单元2用于判断主节点中最后更新数据的时间与系统时间的时间差是否小于第一时间阈值；备份节点启动单元3用于当主节点中最后更新数据的时间与系统时间的时间差小于第一时间阈值时，启动备份节点从主节点处同步数据；轮询控制单元4用于轮询检测备份节点，当备份节点完成数据同步时，停止备份节点同步数据。Figure 3 is a schematic structural diagram of a disaster recovery backup device according to an embodiment of the present invention. As shown in Figure 3, the disaster recovery backup device includes: a backup node access unit 1, a master node access unit 2, a backup node startup unit 3 and a polling control unit 4. Wherein, the backup node access unit 1 is used to judge whether the backup node is in the data synchronization interruption state; the master node access unit 2 is used to judge whether the time difference between the time of the last data update in the master node and the system time is less than the first time threshold; the backup node starts Unit 3 is used to start the backup node to synchronize data from the master node when the time difference between the time of the last data update in the master node and the system time is less than the first time threshold; the polling control unit 4 is used to poll and detect the backup node, when the backup node When data synchronization is complete, stop the backup node from synchronizing data.

在一实施例中，当备份节点完成数据同步时，若无法停止备份节点，那么上述轮询控制单元4还可以用于轮询检测主节点一预设时间，如果主节点在该预设时间内停止接收外部数据，则可以停止备份节点同步数据。In an embodiment, when the backup node completes data synchronization, if the backup node cannot be stopped, then the above-mentioned polling control unit 4 can also be used to poll and detect the master node for a preset time, if the master node is within the preset time If you stop receiving external data, you can stop the backup node from synchronizing data.

在一实施例中，上述的容灾备份装置还包括报错单元5用于当主节点的状态在上述预设时间内未停止接收外部数据时发出告警信息，进行容灾报错。In an embodiment, the above-mentioned disaster recovery and backup device further includes an error reporting unit 5 for sending an alarm message when the state of the master node does not stop receiving external data within the preset time, for disaster recovery and error reporting.

如图4所示，上述的轮询控制单元4包括一时间差判断模块41及节点中断模块42。时间差判断模块41用于判断备份节点中最后同步数据的时间与主节点中最后更新数据的时间的时间差是否小于第二时间阈值；节点中断模块42用于当备份节点中最后同步数据的时间与主节点中最后更新数据的时间的时间差小于上述第二时间阈值时，停止备份节点同步数据。上述的第二时间阈值可以取不小于5分钟。As shown in FIG. 4 , the above-mentioned polling control unit 4 includes a time difference judgment module 41 and a node interruption module 42 . Time difference judging module 41 is used for judging whether the time difference between the time of the last synchronous data in the backup node and the time of last update data in the master node is less than the second time threshold; When the time difference between the time of last updating data in the nodes is smaller than the second time threshold, the backup node is stopped from synchronizing data. The above-mentioned second time threshold may be not less than 5 minutes.

在一实施例中，轮询控制单元4还包括一判断模块43和一重新启动模块44，判断模块43用于当备份节点启动单元3启动备份节点后，用于判断备份节点是否仍处于数据同步中断状态；重新启动模块44用于当备份节点仍处于数据同步中断状态时，重新启动备份节点进行数据同步。在备份节点启动单元3启动备份节点后，如果判断模块43的判断结果为备份节点仍然为数据同步中断状态，说明备份节点未能成功启动，需要重新启动模块44去重启备份节点进行数据同步。In one embodiment, the polling control unit 4 also includes a judging module 43 and a restart module 44, the judging module 43 is used to judge whether the backup node is still in data synchronization after the backup node startup unit 3 starts the backup node Interrupted state; the restart module 44 is used to restart the backup node to perform data synchronization when the backup node is still in the interrupted state of data synchronization. After the backup node starting unit 3 starts the backup node, if the judging result of the judgment module 43 is that the backup node is still in the data synchronization interruption state, it means that the backup node has failed to start successfully, and the restart module 44 is needed to restart the backup node for data synchronization.

在一实施例中，上述的灾备份装置还包括一计时单元6，计时单元6用于在启动备份节点后开始计时，并检测备份节点当前进行数据同步的整体时间是否大于第三时间阈值，当备份节点当前进行数据同步的整体时间大于第二时间阈值时，报错单元5发出告警信息，进行容灾报错。In one embodiment, the above-mentioned disaster backup device further includes a timing unit 6, which is used to start timing after the backup node is started, and detect whether the overall time for the backup node to perform data synchronization is greater than the third time threshold, when When the current overall time for data synchronization of the backup node is greater than the second time threshold, the error reporting unit 5 sends out an alarm message for disaster recovery and error reporting.

在一实施例中，当备份节点访问单元1的输出结果为否时(即备份节点初始的状态并非为数据同步中断状态)，报错单元6会发出告警信息，进行容灾报错。In one embodiment, when the output result of the backup node access unit 1 is negative (that is, the initial state of the backup node is not the data synchronization interruption state), the error reporting unit 6 will send out an alarm message for disaster recovery and error reporting.

在一实施例中，轮询控制单元4还包括一更新模块45，用于当时间差判断模块41的输出结果为否时，跳转到备份节点启动单元3，启动备份节点使其继续从主节点处同步数据。In one embodiment, the polling control unit 4 also includes an updating module 45, which is used to jump to the backup node starting unit 3 when the output result of the time difference judging module 41 is no, and start the backup node to continue from the master node. sync data.

上述的容灾备份装置可以设置在备份节点中，也可以设置在主节点中，或者也可以独立设置于主节点与备份节点之外。当容灾备份装置设置在备份节点中时，容灾备份装置可以通过SSH(Secure Shell，安全外壳协议)远程访问所述主节点。容灾备份装置设置于主节点中时，可以通过SSH远程访问备份节点。当容灾备份装置于备份节点及主节点之外独立设置时，可以通过SSH远程访问主节点及备份节点。The above-mentioned disaster recovery backup device can be set in the backup node, can also be set in the master node, or can be set independently outside the master node and the backup node. When the disaster recovery backup device is set in the backup node, the disaster recovery backup device can remotely access the master node through SSH (Secure Shell, secure shell protocol). When the disaster recovery backup device is set in the master node, the backup node can be remotely accessed through SSH. When the disaster recovery backup device is set independently from the backup node and the master node, the master node and the backup node can be remotely accessed through SSH.

图5为本发明实施例提供的容灾备份装置的其中一种配置结构图。本发明仅以容灾备份装置独立于主节点、备份节点之外设置为例进行说明，并非对本发明进行限制。如图5所示，主节点用于获取外部源数据，备份节点用于从主节点处复制数据以进行数据同步，容灾备份装置独立于主节点、备份节点之外设置，与备份节点及主节点分别通过网络连接，可以通过SSH对主、备份节点分别进行访问，用于控制备份节点从主节点处复制数据。FIG. 5 is a configuration diagram of a disaster recovery backup device provided by an embodiment of the present invention. The present invention is only illustrated by taking the setting of the disaster recovery backup device independent of the master node and the backup node as an example, and does not limit the present invention. As shown in Figure 5, the master node is used to obtain external source data, and the backup node is used to copy data from the master node for data synchronization. The nodes are respectively connected through the network, and the main and backup nodes can be accessed through SSH, which is used to control the backup node to copy data from the main node.

为了更好地理解本发明实施例提供的容灾备份方法、服务器及系统，下面结合具体的例子进行说明。In order to better understand the disaster recovery backup method, server, and system provided in the embodiments of the present invention, the following description will be made in conjunction with specific examples.

NetApp SnapMirror软件具有经过验证的高效率、精简性及合理成本，因此多年来该软件一直是在各种NetApp存储环境中进行复制和灾难恢复的首选技术。可以利用Network Appliance建立在Data ONTAPP操作系统基础上的NetApp SnapMirror软件来搭建容灾备份系统中的备份节点和主节点，利用容灾备份装置监控并远程控制NetAppSnapMirror软件中的备份节点及主节点，按照图6所示控制结构图实现远程数据同步。在本发明实施例中，分别将NetApp SnapMirror软件部署在北京的两个节点：A节点和B节点分别作为主节点和备份节点。NetApp SnapMirror software's proven efficiency, simplicity, and cost-effectiveness have made it the technology of choice for replication and disaster recovery in a variety of NetApp storage environments for many years. The NetApp SnapMirror software based on the Data ONTAPP operating system of Network Appliance can be used to build the backup node and master node in the disaster recovery backup system, and the disaster recovery backup device can be used to monitor and remotely control the backup node and master node in the NetApp SnapMirror software. The control structure diagram shown in Figure 6 realizes remote data synchronization. In the embodiment of the present invention, the NetApp SnapMirror software is deployed on two nodes in Beijing: node A and node B serve as the master node and the backup node respectively.

首先，检查容灾备份装置的相关参数，如果参数检查正常，则进一步检查B节点的状态是否为数据同步中断(即broken_off)状态。如果B节点不是broken_off状态，则报错退出，不再执行以下操作。Firstly, check the relevant parameters of the disaster recovery and backup device, and if the parameter check is normal, then further check whether the status of the node B is in the data synchronization interruption (ie broken_off) status. If the B node is not in the broken_off state, it will report an error and exit, and the following operations will not be performed.

如果B节点处于broken_off状态，则符合继续同步数据的要求，进一步检查A节点最后更新数据的时间与系统时间的时间差是否小于15分钟。如果A节点最后更新数据的时间比系统时间慢超过15分钟，则继续轮询A节点20次，每次间隔30秒，如果十分钟后A节点最后更新数据的时间还是未同步到当前时间15分钟以内，则报错退出。If node B is in the broken_off state, it meets the requirements for continuing to synchronize data, and further checks whether the time difference between the time when node A last updated data and the system time is less than 15 minutes. If the last data update time of node A is slower than the system time by more than 15 minutes, then continue to poll A node 20 times with an interval of 30 seconds each time. If the last data update time of node A is still not synchronized to the current time after 10 minutes If it is within, an error will be reported and exited.

如果A节点最后更新数据的时间与系统时间的时间差小于15分钟，或者对A节点轮询多次后发现A节点中最后更新数据的时间与系统时间的时间差已小于15分钟，则说明A节点已获得最新的外部数据，继续执行后续步骤，启动B节点，开始从A节点进行数据同步(即启动snapmirrior复制)。If the time difference between the last data update time of node A and the system time is less than 15 minutes, or the time difference between the last data update time of node A and the system time is found to be less than 15 minutes after polling node A for many times, it means that node A has Get the latest external data, proceed to the next steps, start the B node, and start data synchronization from the A node (that is, start the snapmirrior replication).

在启动B节点的同时开始计时，检测B节点进行数据同步的整体时间是否小于40分钟，如果超过40分钟说明系统在数据同步过程中某些模块可能发生hang死，则报错退出。Start timing while starting node B, and check whether the overall time for data synchronization of node B is less than 40 minutes. If it exceeds 40 minutes, it means that some modules of the system may hang during the data synchronization process, and then report an error and exit.

如果B节点进行数据同步的整体时间不超过40分钟，则继续进行后续步骤，轮询检查B节点和A节点的状态。判断B节点和A节点是否为“snapmirrored&transferring”状态(即B节点处于数据同步完成(snapmirrored)状态并且A节点处于数据传输(transferring)状态)，如果是，则等待1分钟再去检查A节点和B节点的状态，此处在检查B节点和A节点状态之前，需要再次对B节点的数据同步的整体时间是否小于40分钟再进行一次判断，只有当B节点的整体数据同步时间仍小于40分钟时才再次检查B节点和A节点的状态。A节点处于transferring状态是指A节点当前正在接收外部数据，若此时停止B节点，有可能不会成功，为了保证成功停止B节点，该实施例中添加了对A节点检查的步骤，这种方式是对本发明方法和装置相关实施例的进一步补充，并非作为本发明的限制。If the overall time for data synchronization by node B does not exceed 40 minutes, proceed to the next step, polling to check the status of node B and node A. Determine whether node B and node A are in the "snapmirrored&transferring" state (that is, node B is in the state of data synchronization completion (snapmirrored) and node A is in the state of data transmission (transferring)), if so, wait for 1 minute before checking A node and B The state of the node. Here, before checking the status of node B and node A, it is necessary to judge again whether the overall data synchronization time of node B is less than 40 minutes. Only when the overall data synchronization time of node B is still less than 40 minutes Then check the status of Node B and Node A again. Node A is in the transferring state, which means that node A is currently receiving external data. If node B is stopped at this time, it may not succeed. In order to ensure that node B is successfully stopped, this embodiment adds a step of checking node A. The manner is a further supplement to the relevant embodiments of the method and device of the present invention, and is not intended as a limitation of the present invention.

如果B节点和A节点不是“snapmirrored&transferring”状态，则进一步判断B节点和A节点是否为“snapmirrored&idle”状态(即B节点处于snapmirrored状态并且A节点处于空闲(idle)状态)。A节点处于idle状态是指A节点既不接收外部数据，同时也不向B节点传输数据，若此时停止B节点，则不会发生停止不了的情况。If B node and A node are not in "snapmirrored&transferring" state, then further judge whether B node and A node are in "snapmirrored&idle" state (that is, B node is in snapmirrored state and A node is in idle (idle) state). Node A is in the idle state, which means that node A neither receives external data nor transmits data to node B. If node B is stopped at this time, it will not fail to stop.

如果B节点和A节点的状态不是“snapmirrored&idle”状态，则进一步判断B节点及A节点是否为“broken_off&idle”状态(即B节点处于数据同步中断状态并且A节点处于空闲状态)，如果B节点及A节点为“broken_off&idle”状态，说明B节点未启动成功，需要重新启动B节点来进行数据同步。在具体实施时，也可以仅对B节点做进一步判断，当B节点为数据同步中断状态时，说明B节点未能成功启动，需要重新启动B节点来同步数据。If the states of Node B and Node A are not in the "snapmirrored&idle" state, further determine whether Node B and Node A are in the "broken_off&idle" state (that is, Node B is in the state of data synchronization interruption and Node A is in the idle state), if Node B and Node A The node is in the "broken_off&idle" state, indicating that the B node has not started successfully, and the B node needs to be restarted for data synchronization. In specific implementation, further judgment may be made only on the node B. When the node B is in the state of data synchronization interruption, it means that the node B has failed to start successfully, and the node B needs to be restarted to synchronize data.

如果B节点和A节点的状态为“snapmirrored&idle”状态，则说明数据同步完毕，进一步检查B节点上最后同步数据的时间与A节点最后更新数据的时间的时间差是否小于5分钟，如果小于5分钟，则说明数据同步成功，如果大于5分钟则执行更新(update)命令，使B节点继续从A节点同步数据。当B节点中最后同步数据的时间与A节点中最后更新数据的时间的差小于5分钟时，通常认为B节点已经成功同步A节点中最新的数据。If the statuses of node B and node A are "snapmirrored&idle", it means that the data synchronization is complete. Further check whether the time difference between the last data synchronization time on node B and the last data update time on node A is less than 5 minutes. If it is less than 5 minutes, It means that the data synchronization is successful. If it is longer than 5 minutes, execute the update (update) command to make the B node continue to synchronize data from the A node. When the difference between the last data synchronization time in node B and the last data update time in node A is less than 5 minutes, it is usually considered that node B has successfully synchronized the latest data in node A.

本发明实施例通过将容灾应用服务器和NetApp SnapMirror软件进行结合，实现了管理中心对各受控端节点的远程控制，并且实现了流程化和自动化，并且自动化监控NetApp SnapMirror软件的数据同步状态，实现了高效自动化的远程数据同步，使远程SSH数据交互成为可控，实现了远程命令的自动化执行。In the embodiment of the present invention, by combining the disaster recovery application server and the NetApp SnapMirror software, the remote control of each controlled end node by the management center is realized, and the process and automation are realized, and the data synchronization status of the NetApp SnapMirror software is automatically monitored. It realizes efficient and automatic remote data synchronization, makes remote SSH data interaction controllable, and realizes the automatic execution of remote commands.

本领域内的技术人员应明白，本发明的实施例可提供为方法、系统、或计算机程序产品。因此，本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a An apparatus for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby The instructions provide steps for implementing the functions specified in the flow chart or blocks of the flowchart and/or the block or blocks of the block diagrams.

本发明中应用了具体实施例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本发明的限制。In the present invention, specific examples have been applied to explain the principles and implementation methods of the present invention, and the descriptions of the above examples are only used to help understand the method of the present invention and its core idea; meanwhile, for those of ordinary skill in the art, according to this The idea of the invention will have changes in the specific implementation and scope of application. To sum up, the contents of this specification should not be construed as limiting the present invention.

Claims

1. a kind of disaster-tolerant backup method, which is characterized in that the disaster-tolerant backup method includes：

Judge whether backup node is in data sync break state；

When the backup node is in data sync break state, the time of final updating data and system in host node are judged Whether the time difference of time is less than first time threshold；

If the time of final updating data and the time difference of system time are less than first time threshold in the host node, start The backup node is from synchrodata from the host node；

Poll detects the backup node, when the backup node, which completes data, to be synchronized, stops the same step number of the backup node According to.

2. disaster-tolerant backup method according to claim 1, which is characterized in that complete data when the backup node and synchronize When, if can not stop the backup node, the disaster-tolerant backup method further includes：Poll detects the host node one when presetting Between, if the host node stops receiving external data in the preset time, stop the backup node synchrodata.

3. disaster-tolerant backup method according to claim 2, which is characterized in that the disaster-tolerant backup method further includes：If The state of the host node does not stop receiving external data in the preset time, then sends out warning information, carries out disaster tolerance report It is wrong.

4. disaster-tolerant backup method according to claim 1 or 2, which is characterized in that when backup node completion data are same When step, the disaster-tolerant backup method further includes：

Judge in the backup node time of final updating data in the time of last synchrodata and the host node when Between difference whether be less than second time threshold；

When the time of the time of last synchrodata and the time of final updating data in the host node in the backup node When difference is less than the second time threshold, stop the backup node synchrodata.

5. disaster-tolerant backup method according to claim 1, which is characterized in that the final updating data in the host node Time and the time difference of system time are less than first time threshold, start the backup node from synchrodata from the host node Afterwards, the disaster-tolerant backup method further includes：The backup node is judged whether still in data sync break state, if described Backup node restarts the backup node and carries out data synchronization still in data sync break state.

6. disaster-tolerant backup method according to claim 1 or 5, which is characterized in that the disaster-tolerant backup method further includes： Start timing after starting the backup node, and detect the backup node currently carry out data synchronization overall time it is whether big In third time threshold, if it is, sending out warning information, carries out disaster tolerance and report an error.

7. disaster-tolerant backup method according to claim 1, which is characterized in that the disaster-tolerant backup method further includes：Judge When the result whether backup node is in data sync break state is no, warning information is sent out, disaster tolerance is carried out and reports an error.

8. disaster-tolerant backup method according to claim 4, which is characterized in that the disaster-tolerant backup method further includes：Work as institute The time difference for stating the time of final updating data in the time of last synchrodata and the host node in backup node is more than institute When stating second time threshold, the backup node is made to continue synchrodata.

9. a kind of disaster-tolerant backup device, which is characterized in that the disaster-tolerant backup device includes：

Backup node access unit, for judging whether backup node is in data sync break state；

Host node access unit, for judging whether the time of final updating data and the time difference of system time are small in host node In first time threshold；

Backup node start unit, for small when the time of final updating data in the host node and the time difference of system time When first time threshold, start the backup node from synchrodata from the host node；

Polling system unit detects the backup node for poll, when the backup node, which completes data, to be synchronized, stops institute State backup node synchrodata.

10. disaster-tolerant backup device according to claim 9, which is characterized in that complete data when the backup node and synchronize When, if can not stop the backup node, the polling system unit is additionally operable to：Poll detects the host node one when presetting Between, if the host node stops receiving external data in the preset time, stop the backup node synchrodata.

11. disaster-tolerant backup device according to claim 10, which is characterized in that the disaster-tolerant backup device further includes：Report Wrong unit, for sending out alarm letter when the state of the host node does not stop the when of receiving external data in the preset time Breath carries out disaster tolerance and reports an error.

12. disaster-tolerant backup device according to claim 9 or 10, which is characterized in that the polling system unit further includes：

Time difference judgment module, it is last in the time of last synchrodata and the host node for judging in the backup node Whether the time difference of the time updated the data is less than second time threshold；

Node interrupts module, for when the time of last synchrodata in the backup node and final updating in the host node When the time difference of the time of data is less than the second time threshold, stop the backup node synchrodata.

13. disaster-tolerant backup device according to claim 9, which is characterized in that the polling system unit further includes：

Judgment module, after the backup node start unit starts backup node, for judging the backup node whether still In data sync break state；

Module is restarted, for when the backup node is still in data sync break state, restarting the backup Node carries out data synchronization.

14. the disaster-tolerant backup device according to claim 9 or 13, which is characterized in that the disaster-tolerant backup device further includes One timing unit for starting timing after starting the backup node, and detects the backup node currently to carry out data same Whether the overall time of step is more than third time threshold, when the overall time that the backup node currently carries out data synchronization is more than When the third time threshold, the unit that reports an error sends out warning information, carries out disaster tolerance and reports an error.

15. disaster-tolerant backup device according to claim 11, which is characterized in that the unit that reports an error is additionally operable to when backup section When the output result of point access unit is no, warning information is sent out, disaster tolerance is carried out and reports an error.

16. disaster-tolerant backup device according to claim 12, which is characterized in that the polling system unit further includes：More New module, for when the output result of the time difference judgment module is no, the backup node being made to continue synchrodata.

17. disaster-tolerant backup device according to claim 9, which is characterized in that the disaster-tolerant backup device is set to described In backup node, the disaster-tolerant backup device remotely accesses the host node by SSH, wherein SSH indicates Secure Shell association View.

18. disaster-tolerant backup device according to claim 9, which is characterized in that the disaster-tolerant backup device is set to described In host node, the disaster-tolerant backup device remotely accesses the backup node by SSH.

19. disaster-tolerant backup device according to claim 9, which is characterized in that the disaster-tolerant backup is installed on the backup It is independently arranged except node and host node.