CN118689710A

CN118689710A - A disaster recovery switching device for automated operation and maintenance system

Info

Publication number: CN118689710A
Application number: CN202411163061.XA
Authority: CN
Inventors: 廖万里; 金卓; 叶锡建; 蔡金萍
Original assignee: Zhuhai Kingsware Information Technology Co Ltd
Current assignee: Zhuhai Kingsware Information Technology Co Ltd
Priority date: 2024-08-23
Filing date: 2024-08-23
Publication date: 2024-09-24
Anticipated expiration: 2044-08-23
Also published as: CN118689710B

Abstract

The invention discloses a disaster recovery switching device for an automatic operation and maintenance system, which is used for restoring a database of the device to a point and improving the recovery efficiency of data; the problem of saving the data of the component, the flow and the like which are being edited is considered, the possibility of data loss is reduced, and the reliability of disaster recovery switching is improved; meanwhile, the problem of proxy end connection efficiency is solved, and disaster recovery switching efficiency is greatly improved. The self-defined configuration monitoring mode of the invention improves user experience, flexibility and expansibility, solves the defect that the traditional mode relies on a network to judge whether a disaster occurs in a data center, reduces misjudgment caused by occasional network fluctuation, and improves the accuracy of disaster judgment.

Description

A disaster recovery switching device for automated operation and maintenance system

技术领域Technical Field

本发明涉及运维系统领域，特别涉及一种用于自动化运维系统的灾备切换装置。The present invention relates to the field of operation and maintenance systems, and in particular to a disaster recovery switching device for an automated operation and maintenance system.

背景技术Background Art

随着IT技术的快速发展，目前大部分企业都使用IT技术来支持业务系统的正常运行，并且业务系统本身的架构越来越复杂，涉及的操作环节和相关人员也越来越多，特别是金融相关行业。为了满足金融行业的运维管理需求，保障运维管理质量，降低运维成本，大部分金融行业均已使用自动化运维系统来协助企业内部的运维管理，所以当灾难发生时，保证自动化运维系统的可用性和连续性将变得尤为重要。With the rapid development of IT technology, most companies currently use IT technology to support the normal operation of business systems, and the architecture of the business system itself is becoming more and more complex, involving more and more operational links and related personnel, especially in the financial industry. In order to meet the operation and maintenance management needs of the financial industry, ensure the quality of operation and maintenance management, and reduce operation and maintenance costs, most financial industries have used automated operation and maintenance systems to assist in the internal operation and maintenance management of the enterprise. Therefore, when a disaster occurs, it will become particularly important to ensure the availability and continuity of the automated operation and maintenance system.

在这样的背景下，目前自动化运维系统应对灾难时面临以下几点问题：In this context, the current automated operation and maintenance system faces the following problems when responding to disasters:

1、人工操作繁琐、成本高：由于系统架构的复杂性，需要运维人员非常熟悉系统架构，否则无法快速、高效的应对灾难，并且运维人员在进行灾难切换时需要执行一系列的操作，消耗了大量的人力资源和成本。不仅如此，人工判断和灾备切换操作容易出现差错，从而影响了系统的稳定性。1. Manual operation is cumbersome and costly: Due to the complexity of the system architecture, the operation and maintenance personnel need to be very familiar with the system architecture, otherwise they cannot respond to disasters quickly and efficiently. In addition, the operation and maintenance personnel need to perform a series of operations when performing disaster switching, which consumes a lot of human resources and costs. In addition, manual judgment and disaster recovery switching operations are prone to errors, which affects the stability of the system.

2、灾备切换处理时间长：一方面，系统的复杂性和人工操作环节的繁琐，导致了灾备切换操作耗时长。另一方面，自动化运维系统数据量大，当灾难恢复时，数据恢复需要花费较长的时间。再者，自动化运维系统依赖于代理端来执行各种业务需求，当灾难发生时，代理端的连接处理需要很长的时间。以上种种原因都大大降低了自动化运维系统的可用性和连续性。2. Disaster recovery switching takes a long time: On the one hand, the complexity of the system and the tediousness of manual operation lead to a long time for disaster recovery switching. On the other hand, the automated operation and maintenance system has a large amount of data. When disaster recovery occurs, data recovery takes a long time. Furthermore, the automated operation and maintenance system relies on the agent to execute various business needs. When a disaster occurs, the connection processing of the agent takes a long time. All of the above reasons have greatly reduced the availability and continuity of the automated operation and maintenance system.

3、切换策略单一：目前自动化运维系统的灾备切换操作只能在灾难实际发生时，由用户主观判断并手动执行需要切换到哪一个备用数据中心，以保证自动化运维系统的正常运行。这种切换判断策略单一，只能由人为主观的决定，无法形成自动、客观的判断，从而有失客观性。3. Single switching strategy: Currently, the disaster recovery switching operation of the automated operation and maintenance system can only be performed when a disaster actually occurs. The user subjectively judges and manually executes which backup data center to switch to in order to ensure the normal operation of the automated operation and maintenance system. This single switching judgment strategy can only be determined by human subjectivity, and cannot form an automatic and objective judgment, thus losing objectivity.

4、数据一致性差：当用户在执行组件、流程等编辑操作时，发生了灾难，正在编辑的未保存数据将会丢失，导致用户需要重新编辑，数据的一致性和业务的连续性差。4. Poor data consistency: When a disaster occurs while a user is performing editing operations on components or processes, the unsaved data being edited will be lost, requiring the user to re-edit it, resulting in poor data consistency and business continuity.

在以上背景的前提下，目前存在一些灾备的监测和切换方法，但都没有很好的同时解决上述所有问题，所以需要一种用于自动化运维系统的灾备切换装置来解决前文所述的问题。Under the premise of the above background, there are currently some disaster recovery monitoring and switching methods, but none of them can solve all the above problems at the same time. Therefore, a disaster recovery switching device for an automated operation and maintenance system is needed to solve the problems mentioned above.

发明内容Summary of the invention

本发明的目的在于克服现有技术的缺点与不足，提供一种用于自动化运维系统的灾备切换装置。The purpose of the present invention is to overcome the shortcomings and deficiencies of the prior art and to provide a disaster recovery switching device for an automated operation and maintenance system.

本发明的目的通过以下的技术方案实现：The purpose of the present invention is achieved through the following technical solutions:

一种用于自动化运维系统的灾备切换装置，包括数据采集与分析模块、灾备方案选择模块、灾备切换执行模块、灾备切换验证模块、灾备切换数据恢复模块；其中：A disaster recovery switching device for an automated operation and maintenance system comprises a data acquisition and analysis module, a disaster recovery scheme selection module, a disaster recovery switching execution module, a disaster recovery switching verification module, and a disaster recovery switching data recovery module; wherein:

数据采集与分析模块，用于获取自动化运维系统的环境数据，并根据预设的判断规则判定当前数据中心是否发生灾难；The data collection and analysis module is used to obtain environmental data from the automated operation and maintenance system and determine whether a disaster has occurred in the current data center based on preset judgment rules;

灾备方案选择模块，用于确定灾难发生时，将要切换到哪个备用数据中心；当配置了自动切换任务，则从自动化运维系统配置中获取备用数据中心，否则由用户通过点击的方式确定备用数据中心；The disaster recovery plan selection module is used to determine which backup data center will be switched to when a disaster occurs. When the automatic switching task is configured, the backup data center is obtained from the automated operation and maintenance system configuration. Otherwise, the user determines the backup data center by clicking.

灾备切换执行模块，用于执行灾备切换，将数据中心从主数据中心切换到备用数据中心，执行一系列的切换操作，切换操作包含数据库状态转变、数据库还原点设置、防火墙设置；The disaster recovery switching execution module is used to execute disaster recovery switching, switch the data center from the primary data center to the backup data center, and perform a series of switching operations, including database state change, database restore point setting, and firewall setting;

灾备切换验证模块，用于执行灾备切换后的验证流程，检验流程、组件、任务是否正常，以及代理端的连接是否正常，确认灾备切换执行正确，并且备用数据中心当前为可用状态；The disaster recovery switch verification module is used to execute the verification process after the disaster recovery switch, check whether the process, components, tasks are normal, and whether the connection of the agent is normal, confirm that the disaster recovery switch is executed correctly, and the backup data center is currently available;

灾备切换数据恢复模块，用于根据数据采集与分析模块的数据判断原主数据中心是否恢复正常，若恢复正常则自动或手动切换回生产数据中心。The disaster recovery switching data recovery module is used to determine whether the original main data center has returned to normal based on the data from the data collection and analysis module. If it has returned to normal, it will automatically or manually switch back to the production data center.

灾备方案选择模块中，所述自动切换任务，是由用户事先配置好各个数据中心的灾备连接关系，即当前数据中心发生灾难时，使用哪一个备用数据中心继续运行。In the disaster recovery plan selection module, the automatic switching task is configured by the user in advance to configure the disaster recovery connection relationship of each data center, that is, when a disaster occurs in the current data center, which backup data center will be used to continue operation.

所述灾备方案选择模块的具体工作过程为：The specific working process of the disaster recovery solution selection module is as follows:

步骤1、自动化运维系统实时监测生产环境的若干个指标，并计算出生产数据中心的状态值，若生产数据中心的状态值大于预设值，则判断当前生产数据中心为异常状态，并转到步骤2；Step 1: The automated operation and maintenance system monitors several indicators of the production environment in real time and calculates the status value of the production data center. If the status value of the production data center is greater than a preset value, the current production data center is judged to be in an abnormal state and the process goes to step 2.

步骤2、判断自动化运维系统是否配置了自动切换任务，若配置了自动切换任务，则由系统自动获取当前数据中心的灾备连接关系，执行灾备切换操作，使得自动化运维系统切换到备用数据中心，否则由用户主观选择需要切换到哪一个备用数据中心，并点击该数据中心的切换按钮；Step 2: Determine whether the automatic operation and maintenance system is configured with an automatic switching task. If so, the system automatically obtains the disaster recovery connection relationship of the current data center and performs a disaster recovery switching operation, so that the automatic operation and maintenance system switches to the backup data center. Otherwise, the user subjectively selects which backup data center to switch to and clicks the switch button for that data center.

在实际执行灾备切换时，首先将发生灾难的时间点设置为数据库还原点，设置方式是自动化运维系统的管理系统首先取消当前正在进行的受管数据库的恢复操作，然后创建一个恢复点并指定恢复数据库，最后断开主库与备库之间的连接操作，用于提高灾难恢复后数据恢复的效率，数据恢复仅从还原点开始，将灾难时间段内的执行数据按增量的方式恢复到生产数据库，大大缩短数据恢复时间；When actually performing disaster recovery switching, the time point when the disaster occurred is first set as the database restore point. The setting method is that the management system of the automated operation and maintenance system first cancels the recovery operation of the managed database currently in progress, then creates a recovery point and specifies the recovery database, and finally disconnects the connection between the primary database and the backup database to improve the efficiency of data recovery after disaster recovery. Data recovery only starts from the restore point, and the execution data within the disaster time period is incrementally restored to the production database, greatly shortening the data recovery time;

然后自动化运维系统的管理系统变更服务器的“快速连接”设置为启用状态，用于代理端快速连接服务器，保证业务能够使用代理正常执行；接着开启操作系统、服务器的防火墙配置，拦截所有发送到生产数据中心的请求，并向请求的发送方返回拦截信息；接着取消当前正在进行的受管数据库的恢复操作，激活备用数据中心，使得备用数据中心的数据库变为可读可写的状态，为后期在备用数据中心运行业务做好前期准备，并且将运行数据写入到备用数据库中。Then the management system of the automated operation and maintenance system changes the "Quick Connection" setting of the server to the enabled state, which is used by the agent to quickly connect to the server and ensure that the business can be executed normally using the agent; then the firewall configuration of the operating system and server is turned on to intercept all requests sent to the production data center and return the interception information to the sender of the request; then the recovery operation of the currently ongoing managed database is canceled and the standby data center is activated, so that the database of the standby data center becomes readable and writable, making preliminary preparations for running the business in the standby data center in the future, and writing the operating data to the standby database.

步骤1中，所述生产数据中心的状态值Q，计算方式如下：In step 1, the state value Q of the production data center is calculated as follows:

若系统管理员对五个监控指标内存占比、网络延迟、CPU占比、线程数占比、服务器在线数没有特别的要求时，默认使用自动化运维系统自带的算法来判断生产数据中心的状态：If the system administrator has no special requirements for the five monitoring indicators of memory usage, network latency, CPU usage, thread usage, and number of online servers, the default algorithm of the automated operation and maintenance system is used to determine the status of the production data center:

Q=0.18x+0.3y+0.15z+0.12i+0.25j；Q=0.18x+0.3y+0.15z+0.12i+0.25j;

其中，内存占比为x，网络延迟为y，CPU占比为z，线程数占比为i，服务器在线数为j；Among them, the memory ratio is x, the network delay is y, the CPU ratio is z, the number of threads is i, and the number of online servers is j;

若系统管理员对五个监控指标内存占比、网络延迟、CPU占比、线程数占比、服务器在线数有特别的要求，可由管理员自定义配置指标临界值、指标发生次数和指标权重，由此计算新的生产数据中心的状态值Q。If the system administrator has special requirements for the five monitoring indicators of memory share, network latency, CPU share, thread share, and number of online servers, the administrator can customize the indicator critical value, indicator occurrence number, and indicator weight to calculate the status value Q of the new production data center.

所述自动化运维系统的代理端使用“快速连接”的方式启动，即先保证代理端能够快速、稳定的连接到服务器，然后再通过异步的方式处理代理端的连接检查。The agent of the automated operation and maintenance system is started in a "quick connection" manner, that is, first ensuring that the agent can quickly and stably connect to the server, and then processing the connection check of the agent in an asynchronous manner.

假设所述自动化运维系统的数据库在t0时间点创建，则在没有发生灾难的时间里，数据实时由主数据中心同步到备用数据中心，两个数据中心的数据保持同步；当在t2时间点发生灾难，主数据中心不再同步数据到备用数据中心，并且将t2时间点作为数据库还原点；在t2到t6这段灾难时间里，自动化运维系统已切换到备用数据中心，所有产生的操作数据都在备用数据中心；当在t6时间点灾难恢复正常，则将t2到t6时间段内的操作数据恢复到主数据中心，从此之后数据依然保持实时由主数据中心同步到备用数据中心。Assuming that the database of the automated operation and maintenance system is created at time point t0, then when no disaster occurs, the data is synchronized from the primary data center to the backup data center in real time, and the data of the two data centers remain synchronized; when a disaster occurs at time point t2, the primary data center no longer synchronizes data to the backup data center, and time point t2 is used as the database restore point; during the disaster period from t2 to t6, the automated operation and maintenance system has switched to the backup data center, and all generated operation data is in the backup data center; when the disaster is recovered normally at time point t6, the operation data in the time period from t2 to t6 is restored to the primary data center, and from then on, the data continues to be synchronized from the primary data center to the backup data center in real time.

本发明与现有技术相比，具有如下优点和有益效果：Compared with the prior art, the present invention has the following advantages and beneficial effects:

1、本发明使用数据库还原点，提高数据的恢复效率。1. The present invention uses database restore points to improve data recovery efficiency.

2、本发明考虑了正在编辑的组件、流程等数据的保存问题，减少数据丢失的可能性，提高了灾备切换的可靠性。2. The present invention takes into account the preservation of data such as components and processes being edited, reduces the possibility of data loss, and improves the reliability of disaster recovery switching.

3、本发明解决了代理端连接效率问题，大大提高了灾备切换的效率。3. The present invention solves the problem of proxy connection efficiency and greatly improves the efficiency of disaster recovery switching.

4、本发明自定义配置的监控方式提升用户体验、提高灵活性和扩展性，同时解决了传统方式依赖网络来判断数据中心是否发生灾难的弊端，降低了偶尔网络波动带来的误判，提高灾难判断的准确性。4. The custom-configured monitoring method of the present invention improves user experience, flexibility and scalability, while solving the drawbacks of the traditional method of relying on the network to determine whether a disaster has occurred in the data center, reducing misjudgments caused by occasional network fluctuations and improving the accuracy of disaster judgment.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明所述自动化运维系统的架构图。FIG1 is an architecture diagram of the automated operation and maintenance system of the present invention.

图2为本发明所述自动化运维系统的多数据中心结构图。FIG2 is a diagram showing a multi-data center structure of the automated operation and maintenance system of the present invention.

图3为本发明所述用于自动化运维系统的灾备切换装置的结构示意图。FIG3 is a schematic diagram of the structure of the disaster recovery switching device for the automated operation and maintenance system according to the present invention.

图4为本发明所述数据恢复到主数据中心的示意图。FIG. 4 is a schematic diagram of data recovery to a primary data center according to the present invention.

图5为本发明所述灾备切换的流程图。FIG5 is a flow chart of the disaster recovery switching according to the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合实施例及附图对本发明作进一步详细的描述，但本发明的实施方式不限于此。The present invention is further described in detail below in conjunction with embodiments and drawings, but the embodiments of the present invention are not limited thereto.

一、自动化运维系统灾备架构介绍。1. Introduction to the disaster recovery architecture of the automated operation and maintenance system.

如图1，首先介绍自动化运维系统的架构（如图1所示），由服务器（Server）、控制台（Control）、代理端（Agent）三部分组成。服务器的作用是提供自动化运维服务及其相关功能，以集群的方式部署，并且使用redis集群来提高读写能力，数据库类型支持mysql、oracle、Gauss等多种类型，本文以oracle为例，采用主库和备库实时同步的模式；控制台连接服务器，是用户的操作页面，可在控制台进行组件、流程、任务等编辑操作；代理端连接服务器，主要负责任务、流程等执行操作，只有代理端稳定连接服务器，才能保证任务、流程等正常执行。As shown in Figure 1, we first introduce the architecture of the automated operation and maintenance system (as shown in Figure 1), which consists of three parts: server, control, and agent. The server is used to provide automated operation and maintenance services and related functions. It is deployed in a cluster and uses redis clusters to improve read and write capabilities. The database type supports multiple types such as mysql, oracle, and Gauss. This article takes oracle as an example and adopts the mode of real-time synchronization between the main database and the backup database; the control connects to the server and is the user's operation page. You can edit components, processes, tasks, etc. in the control console; the agent connects to the server and is mainly responsible for executing tasks and processes. Only when the agent is stably connected to the server can the normal execution of tasks and processes be guaranteed.

如图2，自动化运维系统可以部署一个生产数据中心（即主数据中心）和多个备用数据中心，生产数据中心和各个备用数据中心之间可以互为灾备连接关系。正常情况下，自动化运维系统在生产数据中心运行，其他备用数据中心仅作为备用的运行环境，数据由生产数据中心实时同步到各个备用数据中心。As shown in Figure 2, the automated operation and maintenance system can deploy a production data center (i.e., the main data center) and multiple backup data centers. The production data center and each backup data center can be connected to each other for disaster recovery. Under normal circumstances, the automated operation and maintenance system runs in the production data center, and other backup data centers only serve as backup operating environments. Data is synchronized from the production data center to each backup data center in real time.

二、灾难事件判断模型。2. Disaster event judgment model.

本发明建立一种灾难事件判断模型，通过自动化运维系统，每5秒获取并计算生产数据中心的内存占比、网络延迟、CPU占比、线程数占比、服务器在线数五个指标的情况，结合指标发生次数和指标权重，计算得出当前生产数据中心是否处于异常状态，若处于异常状态则执行切换操作，否则保持现状。The present invention establishes a disaster event judgment model. Through the automated operation and maintenance system, the five indicators of memory share, network delay, CPU share, thread share, and number of online servers of the production data center are obtained and calculated every 5 seconds. Combined with the number of indicator occurrences and indicator weights, it is calculated whether the current production data center is in an abnormal state. If it is in an abnormal state, a switching operation is performed; otherwise, the status quo is maintained.

实施案例一，若系统管理员对五个监控指标没有特别的要求时，默认使用系统自带的算法来判断生产数据中心的状态，具体算法如下所示。Implementation case 1: If the system administrator has no special requirements for the five monitoring indicators, the system's own algorithm is used by default to determine the status of the production data center. The specific algorithm is shown below.

对于内存占比，当占比处于[0,60%]范围内并且发生次数小于等于5次时，则判断为正常的告警级别；当占比处于[0,60%]范围内并且发生次数大于5次时，则判断为一般的告警级别；当占比处于（60%,80%]范围内并且发生次数小于等于5次时，则判断为一般的告警级别；当占比处于（60%,80%]范围内并且发生次数大于5次时，则判断为次要的告警级别；当占比处于（80%,100%]范围内并且发生次数小于等于5次时，则判断为次要的告警级别；当占比处于（80%,100%]范围内并且发生次数大于5次时，则判断为严重的告警级别。For memory usage, when the percentage is in the range of [0,60%] and the number of occurrences is less than or equal to 5 times, it is judged as a normal alarm level; when the percentage is in the range of [0,60%] and the number of occurrences is greater than 5 times, it is judged as a general alarm level; when the percentage is in the range of (60%,80%] and the number of occurrences is less than or equal to 5 times, it is judged as a general alarm level; when the percentage is in the range of (60%,80%] and the number of occurrences is greater than 5 times, it is judged as a minor alarm level; when the percentage is in the range of (80%,100%] and the number of occurrences is less than or equal to 5 times, it is judged as a minor alarm level; when the percentage is in the range of (80%,100%] and the number of occurrences is greater than 5 times, it is judged as a serious alarm level.

表1 内存占比告警级别Table 1 Memory usage alarm levels

对于网络延迟，当网络延迟处于[0ms,50ms]范围内并且发生次数小于等于10次时，则判断为正常的告警级别；当网络延迟处于[0ms,50ms]范围内并且发生次数大于10次时，则判断为一般的告警级别；当网络延迟处于（50ms,100ms]范围内并且发生次数小于等于10次时，则判断为一般的告警级别；当网络延迟处于（50ms,100ms]范围内并且发生次数大于10次时，则判断为次要的告警级别；当网络延迟处于（100ms，+∞)范围内并且发生次数小于等于10次时，则判断为次要的告警级别；当网络延迟处于（100ms，+∞)范围内并且发生次数大于10次时，则判断为严重的告警级别。For network delay, when the network delay is in the range of [0ms, 50ms] and the number of occurrences is less than or equal to 10 times, it is judged as a normal alarm level; when the network delay is in the range of [0ms, 50ms] and the number of occurrences is greater than 10 times, it is judged as a general alarm level; when the network delay is in the range of (50ms, 100ms] and the number of occurrences is less than or equal to 10 times, it is judged as a general alarm level; when the network delay is in the range of (50ms, 100ms] and the number of occurrences is greater than 10 times, it is judged as a minor alarm level; when the network delay is in the range of (100ms, +∞) and the number of occurrences is less than or equal to 10 times, it is judged as a minor alarm level; when the network delay is in the range of (100ms, +∞) and the number of occurrences is greater than 10 times, it is judged as a serious alarm level.

表2 网络延迟告警级别Table 2 Network delay alarm levels

对于CPU占比，当占比处于[0,60%]范围内并且发生次数小于等于8次时，则判断为正常的告警级别；当占比处于[0,60%]范围内并且发生次数大于8次时，则判断为一般的告警级别；当占比处于（60%,80%]范围内并且发生次数小于等于8次时，则判断为一般的告警级别；当占比处于（60%,80%]范围内并且发生次数大于8次时，则判断为次要的告警级别；当占比处于（80%,100%]范围内并且发生次数小于等于8次时，则判断为次要的告警级别；当占比处于（80%,100%]范围内并且发生次数大于8次时，则判断为严重的告警级别。For the CPU ratio, when the ratio is in the range of [0,60%] and the number of occurrences is less than or equal to 8 times, it is judged as a normal alarm level; when the ratio is in the range of [0,60%] and the number of occurrences is greater than 8 times, it is judged as a general alarm level; when the ratio is in the range of (60%,80%] and the number of occurrences is less than or equal to 8 times, it is judged as a general alarm level; when the ratio is in the range of (60%,80%] and the number of occurrences is greater than 8 times, it is judged as a minor alarm level; when the ratio is in the range of (80%,100%] and the number of occurrences is less than or equal to 8 times, it is judged as a minor alarm level; when the ratio is in the range of (80%,100%] and the number of occurrences is greater than 8 times, it is judged as a serious alarm level.

表3 CPU占比告警级别Table 3 CPU usage alarm levels

对于线程数占比，当占比处于[0,60%]范围内并且发生次数小于等于8次时，则判断为正常的告警级别；当占比处于[0,60%]范围内并且发生次数大于8次时，则判断为一般的告警级别；当占比处于（60%,80%]范围内并且发生次数小于等于8次时，则判断为一般的告警级别；当占比处于（60%,80%]范围内并且发生次数大于8次时，则判断为次要的告警级别；当占比处于（80%,100%]范围内并且发生次数小于等于8次时，则判断为次要的告警级别；当占比处于（80%,100%]范围内并且发生次数大于8次时，则判断为严重的告警级别。For the thread number ratio, when the ratio is in the range of [0,60%] and the number of occurrences is less than or equal to 8 times, it is judged as a normal alarm level; when the ratio is in the range of [0,60%] and the number of occurrences is greater than 8 times, it is judged as a general alarm level; when the ratio is in the range of (60%,80%] and the number of occurrences is less than or equal to 8 times, it is judged as a general alarm level; when the ratio is in the range of (60%,80%] and the number of occurrences is greater than 8 times, it is judged as a minor alarm level; when the ratio is in the range of (80%,100%] and the number of occurrences is less than or equal to 8 times, it is judged as a minor alarm level; when the ratio is in the range of (80%,100%] and the number of occurrences is greater than 8 times, it is judged as a serious alarm level.

表4 线程数占比告警级别Table 4 Thread ratio alarm level

对于服务器在线数，当在线数不为零并且发生次数大于等于2次时，则判断为正常的告警级别；当在线数不为零并且发生次数小于2次时，则判断为一般的告警级别；当在线数为零并且发生次数大于等于2次时，则判断为严重的告警级别；当在线数为零并且发生次数小于2次时，则判断为次要的告警级别。For the number of servers online, when the number of online servers is not zero and the number of occurrences is greater than or equal to 2 times, it is judged as a normal alarm level; when the number of online servers is not zero and the number of occurrences is less than 2 times, it is judged as a general alarm level; when the number of online servers is zero and the number of occurrences is greater than or equal to 2 times, it is judged as a serious alarm level; when the number of online servers is zero and the number of occurrences is less than 2 times, it is judged as a minor alarm level.

表5 服务器在线数告警级别Table 5 Alarm levels for online server count

假设内存占比为x，网络延迟为y，CPU占比为z，线程数占比为i，服务器在线数为j，并且这些指标的权重分别为0.18、0.3、0.15、0.12、0.25，权重的加和为1，则建立生产数据中心的状态模型Q为：Assume that the memory ratio is x, the network latency is y, the CPU ratio is z, the thread ratio is i, the number of online servers is j, and the weights of these indicators are 0.18, 0.3, 0.15, 0.12, and 0.25 respectively, and the sum of the weights is 1. Then the state model Q of the production data center is established as:

Q=0.18x+0.3y+0.15z+0.12i+0.25j；Q=0.18x+0.3y+0.15z+0.12i+0.25j;

自动化运维系统每5秒获取并计算生产数据中心的内存占比、网络延迟、CPU占比、线程数占比、服务器在线数五个指标的数据，并结合指标发生次数判断事件的告警级别，若事件告警级别为严重，则指标对应的变量取值为4；若事件告警级别为次要，则指标对应的变量取值为3；若事件告警级别为一般，则指标对应的变量取值为2；若事件告警级别为正常，则指标对应的变量取值为1。接着，代入到Q中计算生产数据中心的状态值，若状态值Q大于3.6，则判断生产数据中心处于异常状态，需要进行灾备切换操作。例如：根据前文的规则，判定内存占比的告警级别为严重，网络延迟的告警级别为次要，CPU占比的告警级别为一般，线程数占比的告警级别为正常，服务器在线数的告警级别为正常，则Q=0.18*4+0.3*3+0.15*2+0.12*1+0.25*1=2.29，数值小于3.6，则数据中心的状态为正常，无需发生灾备切换。The automated operation and maintenance system obtains and calculates the data of five indicators of the production data center every 5 seconds: memory share, network latency, CPU share, thread share, and number of online servers. It also determines the alarm level of the event based on the number of occurrences of the indicators. If the alarm level of the event is severe, the variable corresponding to the indicator is 4; if the alarm level of the event is minor, the variable corresponding to the indicator is 3; if the alarm level of the event is general, the variable corresponding to the indicator is 2; if the alarm level of the event is normal, the variable corresponding to the indicator is 1. Then, substitute it into Q to calculate the status value of the production data center. If the status value Q is greater than 3.6, it is determined that the production data center is in an abnormal state and a disaster recovery switch operation is required. For example: According to the rules in the previous article, the alarm level of memory usage is severe, the alarm level of network delay is minor, the alarm level of CPU usage is general, the alarm level of thread number usage is normal, and the alarm level of online server number is normal. Then Q=0.18*4+0.3*3+0.15*2+0.12*1+0.25*1=2.29. If the value is less than 3.6, the status of the data center is normal and no disaster recovery switching is required.

实施案例二，若系统管理员对五个监控指标有特别的要求，可由管理员自定义配置指标临界值、指标发生次数和指标权重，假设系统管理员配置的信息如下表所示。Implementation case 2: If the system administrator has special requirements for the five monitoring indicators, the administrator can customize the indicator critical value, indicator occurrence frequency and indicator weight. Assume that the information configured by the system administrator is shown in the following table.

表6 自定义指标配置表Table 6 Custom indicator configuration table

根据系统管理员自定义配置的指标临界值和指标发生次数，自动化运维系统判断事件告警级别的规则如下所述。Based on the indicator critical values and indicator occurrence times customized by the system administrator, the rules for the automated operation and maintenance system to determine the event alarm level are as follows.

对于内存占比，当占比处于[0,70%]范围内并且发生次数小于等于5次时，则判断为正常的告警级别；当占比处于[0,70%]范围内并且发生次数大于5次时，则判断为一般的告警级别；当占比处于（70%,100%]范围内并且发生次数小于等于5次时，则判断为次要的告警级别；当占比处于（70%,100%]范围内并且发生次数大于5次时，则判断为严重的告警级别。For memory usage, when the percentage is in the range of [0,70%] and the number of occurrences is less than or equal to 5 times, it is judged as a normal alarm level; when the percentage is in the range of [0,70%] and the number of occurrences is greater than 5 times, it is judged as a general alarm level; when the percentage is in the range of (70%,100%] and the number of occurrences is less than or equal to 5 times, it is judged as a minor alarm level; when the percentage is in the range of (70%,100%] and the number of occurrences is greater than 5 times, it is judged as a serious alarm level.

表7 内存占比告警级别Table 7 Memory usage alarm levels

对于网络延迟，当网络延迟处于[0ms,50ms]范围内并且发生次数小于等于10次时，则判断为正常的告警级别；当网络延迟处于[0ms,50ms]范围内并且发生次数大于10次时，则判断为一般的告警级别；当网络延迟处于（50ms，+∞)范围内并且发生次数小于等于10次时，则判断为次要的告警级别；当网络延迟处于（50ms，+∞)范围内并且发生次数大于10次时，则判断为严重的告警级别。For network delay, when the network delay is in the range of [0ms, 50ms] and the number of occurrences is less than or equal to 10 times, it is judged as a normal alarm level; when the network delay is in the range of [0ms, 50ms] and the number of occurrences is greater than 10 times, it is judged as a general alarm level; when the network delay is in the range of (50ms, +∞) and the number of occurrences is less than or equal to 10 times, it is judged as a minor alarm level; when the network delay is in the range of (50ms, +∞) and the number of occurrences is greater than 10 times, it is judged as a severe alarm level.

表8 网络延迟告警级别Table 8 Network delay alarm levels

对于CPU占比，当占比处于[0,65%]范围内并且发生次数小于等于5次时，则判断为正常的告警级别；当占比处于[0,65%]范围内并且发生次数大于5次时，则判断为一般的告警级别；当占比处于（65%,100%]范围内并且发生次数小于等于5次时，则判断为次要的告警级别；当占比处于（65%,100%]范围内并且发生次数大于5次时，则判断为严重的告警级别。For the CPU ratio, when the ratio is in the range of [0,65%] and the number of occurrences is less than or equal to 5 times, it is judged as a normal alarm level; when the ratio is in the range of [0,65%] and the number of occurrences is greater than 5 times, it is judged as a general alarm level; when the ratio is in the range of (65%,100%] and the number of occurrences is less than or equal to 5 times, it is judged as a minor alarm level; when the ratio is in the range of (65%,100%] and the number of occurrences is greater than 5 times, it is judged as a severe alarm level.

表9 CPU占比告警级别Table 9 CPU usage alarm levels

对于线程数占比，当占比处于[0,68%]范围内并且发生次数小于等于5次时，则判断为正常的告警级别；当占比处于[0,68%]范围内并且发生次数大于5次时，则判断为一般的告警级别；当占比处于（68%,100%]范围内并且发生次数小于等于5次时，则判断为次要的告警级别；当占比处于（68%,100%]范围内并且发生次数大于5次时，则判断为严重的告警级别。For the thread number ratio, when the ratio is in the range of [0,68%] and the number of occurrences is less than or equal to 5 times, it is judged as a normal alarm level; when the ratio is in the range of [0,68%] and the number of occurrences is greater than 5 times, it is judged as a general alarm level; when the ratio is in the range of (68%,100%] and the number of occurrences is less than or equal to 5 times, it is judged as a minor alarm level; when the ratio is in the range of (68%,100%] and the number of occurrences is greater than 5 times, it is judged as a serious alarm level.

表10 线程数占比告警级别Table 10 Thread ratio alarm level

表11 服务器在线数告警级别Table 11 Alarm levels for online server count

结合系统管理员自定义的五个指标权重，则生产数据中心状态模型Q为：Combined with the five indicator weights customized by the system administrator, the production data center status model Q is:

Q=0.2x+0.3y+0.1z+0.1i+0.3j；Q=0.2x+0.3y+0.1z+0.1i+0.3j;

自动化运维系统每5秒获取并计算生产数据中心的状态值，计算方式和前文一样，不再重复叙述，若状态值Q大于3.6，则判断生产数据中心处于异常状态，需要进行灾备切换操作。The automated operation and maintenance system obtains and calculates the status value of the production data center every 5 seconds. The calculation method is the same as before and will not be repeated. If the status value Q is greater than 3.6, it is determined that the production data center is in an abnormal state and a disaster recovery switching operation is required.

这种判断模式结合指标发生次数和权重，明确管理员对指标关注的优先级和重点，解决了传统方式依赖网络来判断数据中心是否发生灾难的弊端，提高灾难判断的准确性，优化决策过程。同时，自定义配置的监控方式具有提升用户体验、提高灵活性和扩展性等优势。This judgment mode combines the number of indicator occurrences and weights to clarify the administrator's priority and focus on indicators, solving the drawbacks of the traditional method of relying on the network to judge whether a disaster has occurred in the data center, improving the accuracy of disaster judgment, and optimizing the decision-making process. At the same time, the custom configuration monitoring method has the advantages of improving user experience, flexibility, and scalability.

三、灾难切换模型3. Disaster Switching Model

如图3，用于自动化运维系统的灾备切换装置，适用于灾备切换条件监测和自动化运维系统主备数据中心切换的情况，该装置具体包括：数据采集与分析模块、灾备方案选择模块、灾备切换执行模块、灾备切换验证模块、灾备切换数据恢复模块。As shown in Figure 3, the disaster recovery switching device for the automated operation and maintenance system is suitable for disaster recovery switching condition monitoring and automatic operation and maintenance system primary and standby data center switching. The device specifically includes: data collection and analysis module, disaster recovery plan selection module, disaster recovery switching execution module, disaster recovery switching verification module, and disaster recovery switching data recovery module.

数据采集与分析模块：用于获取自动化运维系统的环境数据，并根据判断规则判定当前数据中心是否发生灾难。Data collection and analysis module: used to obtain environmental data of the automated operation and maintenance system, and determine whether a disaster has occurred in the current data center based on judgment rules.

灾备方案选择模块：用于确定灾难发生时，将要切换到哪个备用数据中心。当配置了自动切换任务，则从系统配置中获取备用数据中心，否则由用户通过点击的方式确定备用数据中心。Disaster recovery plan selection module: used to determine which backup data center to switch to when a disaster occurs. When the automatic switching task is configured, the backup data center is obtained from the system configuration, otherwise the user determines the backup data center by clicking.

灾备切换执行模块：用于执行灾备切换，将数据中心从主数据中心切换到备用数据中心，执行一系列的切换操作，包含数据库状态转变、数据库还原点设置、防火墙设置等。Disaster recovery switching execution module: used to execute disaster recovery switching, switch the data center from the primary data center to the backup data center, and perform a series of switching operations, including database status change, database restore point setting, firewall setting, etc.

灾备切换验证模块：用于执行灾备切换后的验证流程，检验流程、组件、任务等是否正常，以及代理端的连接是否正常，确认灾备切换执行正确，并且备用数据中心当前为可用状态。Disaster recovery switch verification module: used to execute the verification process after disaster recovery switch, check whether the process, components, tasks, etc. are normal, and whether the connection of the agent is normal, confirm that the disaster recovery switch is executed correctly, and that the backup data center is currently available.

灾备切换数据恢复模块：用于根据“数据采集与分析模块”的数据判断原主数据中心是否恢复正常，若恢复正常则自动或手动切换回生产数据中心。Disaster recovery switching data recovery module: used to determine whether the original main data center has returned to normal based on the data from the "data collection and analysis module". If it has returned to normal, it will automatically or manually switch back to the production data center.

当灾难发生时，需要执行切换数据中心等一系列操作，以保证自动化运维系统的流程、任务等能在备用数据中心正常运行。本文提供两种切换方式，分别为手动切换和自动任务切换。自动任务切换是由用户事先配置好各个数据中心的灾备连接关系，即当前数据中心发生灾难时，使用哪一个备用数据中心继续运行。手动切换是当灾难实际发生时，由用户主观判断需要切换到哪一个备用数据中心，并点击该数据中心的切换按钮。如图5，具体的切换步骤如下：When a disaster occurs, a series of operations such as switching data centers need to be performed to ensure that the processes and tasks of the automated operation and maintenance system can run normally in the backup data center. This article provides two switching methods, namely manual switching and automatic task switching. Automatic task switching is when the user configures the disaster recovery connection relationship of each data center in advance, that is, when a disaster occurs in the current data center, which backup data center to use to continue running. Manual switching is when a disaster actually occurs, the user subjectively determines which backup data center to switch to, and clicks the switch button for that data center. As shown in Figure 5, the specific switching steps are as follows:

步骤1：管理系统实时监测生产环境的内存占比、网络延迟、CPU占比、线程数占比、服务器在线数五个指标的情况，通过前文描述的算法计算出生产数据中心的状态值Q，若Q大于3.6，则判断当前生产数据中心为异常状态。Step 1: The management system monitors the five indicators of the production environment in real time: memory usage, network latency, CPU usage, thread usage, and number of online servers. The status value Q of the production data center is calculated using the algorithm described above. If Q is greater than 3.6, the current production data center is judged to be in an abnormal state.

步骤2：根据步骤1的监测数据判定当前环境为异常时，判断是否配置了自动切换任务，若配置了自动切换任务，则由系统自动获取当前数据中心的灾备连接关系，执行灾备切换操作，使得自动化运维系统切换到备用数据中心，否则由用户主观判断需要切换到哪一个备用数据中心，并点击该数据中心的切换按钮。Step 2: When the current environment is determined to be abnormal based on the monitoring data in step 1, determine whether an automatic switching task is configured. If an automatic switching task is configured, the system automatically obtains the disaster recovery connection relationship of the current data center and executes the disaster recovery switching operation, so that the automated operation and maintenance system switches to the backup data center. Otherwise, the user subjectively determines which backup data center to switch to and clicks the switch button for that data center.

在实际执行灾备切换时，首先将发生灾难的时间点设置为数据库还原点，设置方式是管理系统首先取消当前正在进行的受管数据库的恢复操作，然后创建一个恢复点并指定恢复数据库，最后断开主库与备库之间的连接操作作用是提高灾难恢复后数据恢复的效率，数据恢复仅从还原点开始，将灾难时间段内的执行数据按增量的方式恢复到生产数据库，大大缩短数据恢复时间。When actually performing disaster recovery switching, the time point when the disaster occurred is first set as the database restore point. The setting method is that the management system first cancels the recovery operation of the currently ongoing managed database, then creates a recovery point and specifies the recovery database, and finally disconnects the connection between the primary database and the backup database. The operation aims to improve the efficiency of data recovery after disaster recovery. Data recovery only starts from the restore point, and the execution data within the disaster time period is incrementally restored to the production database, greatly shortening the data recovery time.

然后管理系统变更服务器的“快速连接”设置为启用状态，用于代理端快速连接服务器，保证业务能够使用代理正常执行。接着开启操作系统、服务器的防火墙配置，拦截所有发送到生产数据中心的请求，并向请求的发送方返回拦截信息。接着取消当前正在进行的受管数据库的恢复操作，激活备用数据中心，使得备用数据中心的数据库变为可读可写的状态，为后期在备用数据中心运行业务做好前期准备，并且将运行数据写入到备用数据库中。Then the management system changes the "Quick Connection" setting of the server to the enabled state, which is used by the agent to quickly connect to the server and ensure that the business can be executed normally using the agent. Then, the firewall configuration of the operating system and the server is enabled to intercept all requests sent to the production data center and return the interception information to the sender of the request. Then, the recovery operation of the currently ongoing managed database is canceled, and the backup data center is activated, so that the database of the backup data center becomes readable and writable, making preliminary preparations for running the business in the backup data center in the future, and writing the running data to the backup database.

在本实施例中，考虑了正在编辑的组件、流程等数据的保存问题。针对组件、流程、指标等数据，即使用户没有点击保存按钮，服务器也会每隔5分钟将数据保存为一个临时的db文件，文件随数据库同步到各个备数据中心，当用户点击保存后，临时文件自动清除，释放存储空间。若用户在编辑时发生了灾难，灾备切换完成后，备用数据中心会自动将临时db文件的数据读取到备用数据中心使用，减少数据丢失的可能性，提高了灾备切换的可靠性。In this embodiment, the issue of saving the data of components, processes, etc. being edited is taken into consideration. For data such as components, processes, and indicators, even if the user does not click the Save button, the server will save the data as a temporary db file every 5 minutes. The file will be synchronized with the database to each backup data center. When the user clicks Save, the temporary file is automatically cleared to release storage space. If a disaster occurs while the user is editing, after the disaster recovery switch is completed, the backup data center will automatically read the data of the temporary db file to the backup data center for use, reducing the possibility of data loss and improving the reliability of the disaster recovery switch.

在本实施例中，解决了代理端连接效率的问题。由于代理端是任务执行的关键，备用数据中心需处理几万台代理端的连接问题，在传统灾备切换中，批量的代理端连接会导致服务器无法响应甚至重启。在本方法中，代理端使用“快速连接”的方式启动，即先保证代理端能够快速、稳定的连接到服务器，然后再通过异步的方式处理代理端的连接检查。使用该方法之后，2万台代理端的连接速率从原来的15至20分钟，缩短到现在的1至2分钟，大大提高了灾备切换的效率。此外，本方案还解决了大量代理端连接导致控制台卡顿的情况，通过配置系统保留线程数的方式，使得控制台能够正常运作。In this embodiment, the problem of agent connection efficiency is solved. Since the agent is the key to task execution, the backup data center needs to handle the connection problems of tens of thousands of agents. In traditional disaster recovery switching, batches of agent connections will cause the server to be unresponsive or even restart. In this method, the agent is started using the "quick connection" method, which first ensures that the agent can connect to the server quickly and stably, and then processes the agent connection check in an asynchronous manner. After using this method, the connection rate of 20,000 agents has been shortened from the original 15 to 20 minutes to the current 1 to 2 minutes, greatly improving the efficiency of disaster recovery switching. In addition, this solution also solves the problem of console freezes caused by a large number of agent connections, and enables the console to operate normally by configuring the system to retain the number of threads.

正常代理启动需要去检查代理的指标、任务等等信息，只有检查通过了，代理才会连接服务器，这个检查过程很费时间；开启“快速连接”的逻辑是先连接服务器，等服务器有空了再开始检查。Normal agent startup requires checking the agent's indicators, tasks and other information. Only when the check is passed will the agent connect to the server. This checking process is very time-consuming. The logic of turning on "Quick Connection" is to connect to the server first, and then start checking when the server is free.

步骤3：当自动化运维系统已经切换到备用数据中心后，需要运行相应的验证流程，检查关键组件、流程、任务等是否存在，以及代理是否在线，若验证流程正常执行，则表明灾备切换成功，反之则表明灾备切换失败，发送通知给对应的管理员。Step 3: After the automated operation and maintenance system has switched to the backup data center, it is necessary to run the corresponding verification process to check whether key components, processes, tasks, etc. exist, and whether the agent is online. If the verification process is executed normally, it indicates that the disaster recovery switch is successful. Otherwise, it indicates that the disaster recovery switch has failed, and a notification is sent to the corresponding administrator.

步骤4：管理系统依然实时监测生产环境是否恢复正常，当不满足异常判断条件时，则判定生产环境恢复正常，并获取系统的切换方式。若切换方式为手动切换任务，则由管理员用户点击切换的生产数据中心，否则由系统自动获取事先配置的灾备切换方案。在实际执行灾备恢复时，首先激活生产数据中心，使得生产数据中心的数据库变为可读可写的状态。然后关闭系统的防火墙设置，使得所有请求可以发送给生产数据中心。接着，从还原点开始，将所有在还原点时间后产生的数据恢复到生产数据库中，数据恢复完成后，执行清除数据库还原点。Step 4: The management system still monitors in real time whether the production environment has returned to normal. When the abnormal judgment conditions are not met, it is determined that the production environment has returned to normal, and the system's switching method is obtained. If the switching method is a manual switching task, the administrator user clicks on the production data center to switch, otherwise the system automatically obtains the pre-configured disaster recovery switching plan. When actually performing disaster recovery, first activate the production data center so that the database of the production data center becomes readable and writable. Then turn off the system's firewall settings so that all requests can be sent to the production data center. Next, starting from the restore point, all data generated after the restore point time is restored to the production database. After the data recovery is completed, clear the database restore point.

如图4，在本实施例中，通过数据库还原点解决了数据恢复效率问题。假设数据库在t0时间点创建，则在没有发生灾难的时间里，数据实时由主数据中心同步到备用数据中心，两个数据中心的数据保持同步。当在t2时间点发生灾难，主数据中心不再同步数据到备用数据中心，并且将t2时间点作为数据库还原点。在t2到t6这段灾难时间里，自动化运维系统已切换到备用数据中心，所有产生的操作数据都在备用数据中心。当在t6时间点灾难恢复正常，则将t2到t6时间段内的操作数据恢复到主数据中心，从此之后数据依然保持实时由主数据中心同步到备用数据中心。通过这种方式，大大提高了数据恢复到主数据中心的效率。As shown in Figure 4, in this embodiment, the problem of data recovery efficiency is solved by using database restore points. Assuming that the database is created at time point t0, the data is synchronized from the primary data center to the backup data center in real time during the period when no disaster occurs, and the data of the two data centers remain synchronized. When a disaster occurs at time point t2, the primary data center no longer synchronizes data to the backup data center, and uses time point t2 as the database restore point. During the disaster period from t2 to t6, the automated operation and maintenance system has switched to the backup data center, and all generated operation data is in the backup data center. When the disaster is recovered normally at time point t6, the operation data in the time period from t2 to t6 is restored to the primary data center, and from then on the data remains synchronized from the primary data center to the backup data center in real time. In this way, the efficiency of data recovery to the primary data center is greatly improved.

在本实施例中，解决了代理端连接效率和数据恢复效率问题。若配置了自动切换任务，则系统可以自动判断环境的情况，并实现自动切换，为用户提供流畅、无缝且自然的体验。In this embodiment, the problems of proxy connection efficiency and data recovery efficiency are solved. If the automatic switching task is configured, the system can automatically determine the situation of the environment and implement automatic switching, providing users with a smooth, seamless and natural experience.

步骤5：当自动化运维系统已经切换到生产数据中心后，需要运行相应的验证流程，检查关键组件、流程、任务等是否存在，以及代理是否在线，若验证流程正常执行，则表明灾备切换成功，反之则表明灾备切换失败，发送通知给对应的管理员。Step 5: After the automated operation and maintenance system has switched to the production data center, the corresponding verification process needs to be run to check whether key components, processes, tasks, etc. exist, and whether the agent is online. If the verification process is executed normally, it indicates that the disaster recovery switch is successful. Otherwise, it indicates that the disaster recovery switch has failed, and a notification is sent to the corresponding administrator.

上述实施例为本发明较佳的实施方式，但本发明的实施方式并不受上述实施例的限制，其他的任何未背离本发明的精神实质与原理下所作的改变、修饰、替代、组合、简化，均应为等效的置换方式，都包含在本发明的保护范围之内。The above embodiments are preferred implementation modes of the present invention, but the implementation modes of the present invention are not limited to the above embodiments. Any other changes, modifications, substitutions, combinations, and simplifications that do not deviate from the spirit and principles of the present invention should be equivalent replacement methods and are included in the protection scope of the present invention.

Claims

1. A disaster recovery switching device for an automatic operation and maintenance system is characterized in that: the disaster recovery system comprises a data acquisition and analysis module, a disaster recovery scheme selection module, a disaster recovery switching execution module, a disaster recovery switching verification module and a disaster recovery switching data recovery module; wherein:

The data acquisition and analysis module is used for acquiring the environmental data of the automatic operation and maintenance system and judging whether the disaster occurs in the current data center according to a preset judgment rule;

The disaster recovery scheme selection module is used for determining to which backup data center to switch when a disaster occurs; when the automatic switching task is configured, acquiring a standby data center from the configuration of the automatic operation and maintenance system, otherwise, determining the standby data center by a user in a clicking mode;

The disaster recovery switching execution module is used for executing disaster recovery switching, switching the data center from the main data center to the standby data center, and executing a series of switching operations, wherein the switching operations comprise database state transition, database restore point setting and firewall setting;

The disaster recovery switching verification module is used for executing verification flow after disaster recovery switching, checking whether the flow, the components and the tasks are normal or not and whether the connection of the proxy end is normal or not, and confirming that the disaster recovery switching is executed correctly and that the standby data center is in an available state currently;

and the disaster recovery switching data recovery module is used for judging whether the original main data center is recovered to be normal according to the data of the data acquisition and analysis module, and automatically or manually switching back to the production data center if the original main data center is recovered to be normal.

2. The disaster recovery switching device for an automated operation and maintenance system of claim 1, wherein: in the disaster recovery scheme selection module, the automatic switching task is that a user configures the disaster recovery connection relation of each data center in advance, namely, when the current data center has a disaster, which standby data center is used for continuous operation.

3. The disaster recovery switching device for an automated operation and maintenance system of claim 1, wherein: the specific working process of the disaster recovery scheme selection module is as follows:

Step 1, an automatic operation and maintenance system monitors a plurality of indexes of a production environment in real time, calculates a state value of a production data center, judges that the current production data center is in an abnormal state if the state value of the production data center is larger than a preset value, and transfers to the step 2;

Step 2, judging whether the automatic operation and maintenance system is configured with an automatic switching task, if so, automatically acquiring disaster recovery connection relation of the current data center by the system, executing disaster recovery switching operation to enable the automatic operation and maintenance system to switch to the standby data center, otherwise, subjectively selecting which standby data center needs to be switched to by a user, and clicking a switching button of the data center;

When disaster backup switching is actually executed, firstly, a time point at which a disaster occurs is set as a database restoration point, wherein the setting mode is that a management system of an automatic operation and maintenance system firstly cancels the restoration operation of a managed database which is currently in progress, then creates a restoration point and designates a restoration database, finally disconnects the connection operation between a main database and a backup database and is used for improving the data restoration efficiency after disaster restoration, the data restoration is only started from the restoration point, the execution data in a disaster time period is restored to a production database in an incremental mode, and the data restoration time is greatly shortened;

Then the management system of the automatic operation and maintenance system changes the quick connection of the server to be set as an enabling state, and is used for the quick connection of the proxy end to the server, so that the normal execution of the service can be ensured by using the proxy; then starting firewall configuration of an operating system and a server, intercepting all requests sent to a production data center, and returning interception information to a sender of the requests; then, the recovery operation of the currently-ongoing managed database is canceled, the standby data center is activated, so that the database of the standby data center becomes a readable and writable state, a preparation is made for running the service in the standby data center at a later stage, and running data is written into the standby database.

4. The disaster recovery switching device for an automated operation and maintenance system of claim 1, wherein: in step 1, the state value Q of the production data center is calculated as follows:

If the system administrator has no special requirements on the memory ratio, the network delay, the CPU ratio, the thread number ratio and the server online number of the five monitoring indexes, the state of the production data center is judged by default using an algorithm of an automatic operation and maintenance system:

Q=0.18x+0.3y+0.15z+0.12i+0.25j；

Wherein, the memory duty ratio is x, the network delay is y, the CPU duty ratio is z, the thread number duty ratio is i, and the server on-line number is j;

if the system administrator has special requirements on the memory ratio, the network delay, the CPU ratio, the thread number ratio and the server online number of the five monitoring indexes, the administrator can self-define and configure the index critical value, the index occurrence number and the index weight, so that the state value Q of the new production data center is calculated.

5. The disaster recovery switching device for an automated operation and maintenance system of claim 1, wherein: the proxy end of the automatic operation and maintenance system is started in a 'quick connection' mode, namely, the proxy end is ensured to be connected to a server quickly and stably, and then connection checking of the proxy end is processed in an asynchronous mode.

6. The disaster recovery switching device for an automated operation and maintenance system of claim 1, wherein: assuming that the database of the automatic operation and maintenance system is created at the time point t0, synchronizing data from the main data center to the standby data center in real time in the time when no disaster occurs, and keeping the data of the two data centers synchronous; when a disaster occurs at the time point t2, the main data center does not synchronize data to the standby data center any more, and takes the time point t2 as a database restore point; in the period from t2 to t6, the automatic operation and maintenance system is switched to the standby data center, and all generated operation data are in the standby data center; when disaster recovery is normal at the time point of t6, the operation data in the time period from t2 to t6 is restored to the main data center, and the data still remains synchronized from the main data center to the standby data center in real time after that.