CN105515812A - Fault processing method of resources and device - Google Patents
Fault processing method of resources and device Download PDFInfo
- Publication number
- CN105515812A CN105515812A CN201410545516.4A CN201410545516A CN105515812A CN 105515812 A CN105515812 A CN 105515812A CN 201410545516 A CN201410545516 A CN 201410545516A CN 105515812 A CN105515812 A CN 105515812A
- Authority
- CN
- China
- Prior art keywords
- resource
- allocated resource
- node
- service
- unit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title abstract description 4
- 238000000034 method Methods 0.000 claims abstract description 36
- 238000012544 monitoring process Methods 0.000 claims abstract description 33
- 238000011084 recovery Methods 0.000 claims description 7
- 238000012546 transfer Methods 0.000 claims description 3
- 238000007726 management method Methods 0.000 description 32
- 230000002159 abnormal effect Effects 0.000 description 20
- 230000008569 process Effects 0.000 description 18
- 230000004044 response Effects 0.000 description 11
- 238000010586 diagram Methods 0.000 description 7
- 238000012545 processing Methods 0.000 description 6
- 230000005856 abnormality Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000001960 triggered effect Effects 0.000 description 4
- 238000004140 cleaning Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000002955 isolation Methods 0.000 description 2
- 230000005012 migration Effects 0.000 description 2
- 238000013508 migration Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000015654 memory Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0654—Management of faults, events, alarms or notifications using network fault recovery
- H04L41/0663—Performing the actions predefined by failover planning, e.g. switching to standby network elements
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Hardware Redundancy (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
Description
技术领域technical field
本发明涉及通信领域,具体而言,涉及一种资源的故障处理方法及装置。The present invention relates to the communication field, in particular, to a method and device for resource fault handling.
背景技术Background technique
网络附属存储系统广泛用于企业管理平台,其性能的安全可靠性可以直接关系到企业日常运营,因此网络附属存储系统需要保证稳定以及较高的可用性。Network-attached storage systems are widely used in enterprise management platforms, and the security and reliability of their performance can be directly related to the daily operations of enterprises. Therefore, network-attached storage systems need to ensure stability and high availability.
根据Gartner公司所作的统计,导致系统异常运行的原因主要可以主要分为以下几个方面:应用问题(40%)、操作问题(40%)、操作系统故障(10%)和硬件故障(10%),对于网络附属存储集群系统来说,很多情况也有可能是前端某个接入网口、后端某个存储资源的软硬件资源出现异常。在这种场景下,该节点上除了发生异常的模块不能运行之外,其它的模块都正常运行,此时现有技术中采用的技术方案是将整个节点隔离,把业务转移到其它能够正常运行的节点上去,而上述技术方案会使整个接管流程复杂,出错的概率也相应增加,同时整个接管耗时较长,接管成功后接管节点的负载也相应增加,给整个存储业务的过程都带来压力。According to the statistics made by Gartner, the causes of abnormal operation of the system can be mainly divided into the following aspects: application problems (40%), operation problems (40%), operating system failures (10%) and hardware failures (10%) ), for a network-attached storage cluster system, in many cases, it may also be that the software and hardware resources of a front-end access network port and a back-end storage resource are abnormal. In this scenario, except for the abnormal module that cannot run on the node, other modules are running normally. At this time, the technical solution adopted in the existing technology is to isolate the entire node and transfer the business to other nodes that can run normally. The above-mentioned technical solution will complicate the entire takeover process, and the probability of errors will increase accordingly. At the same time, the entire takeover will take a long time. pressure.
此外,当前网络存储集群中,故障管理模块主要是管理本节点上的存储资源,模块本身异常处理是通过节点的重新选举,产生新的接管节点来实现。选举算法以Paxos算法最为出名,在多个开源项目中使用到的,但是基本节点对象的单实例选举,无法解决节点内多个具体对象资源的选举。In addition, in the current network storage cluster, the fault management module mainly manages the storage resources on the local node, and the exception handling of the module itself is realized through the re-election of the node and the generation of a new takeover node. The most famous election algorithm is the Paxos algorithm, which is used in many open source projects, but the single-instance election of the basic node object cannot solve the election of multiple specific object resources in the node.
针对相关技术中,由于很多情况下节点上的资源故障都属于部分故障,但仍然将该节点隔离,将节点的业务转移到其他接管节点上而导致的接管流程复杂,容易出错,同时也增加了接管节点的负载的问题,尚未提出有效的解决方案。In related technologies, resource failures on nodes are partial failures in many cases, but the node is still isolated and the business of the node is transferred to other takeover nodes. The problem of taking over the load of nodes has not yet been proposed an effective solution.
发明内容Contents of the invention
为了解决上述技术问题,本发明提供了一种资源的故障处理方法及装置。In order to solve the above technical problems, the present invention provides a method and device for resource fault handling.
根据本发明的一个方面,提供了一种资源的故障处理方法,包括:监测网络存储集群系统中节点的指定资源是否发生故障,其中,所述指定资源为所述网络存储集群系统中预先划分的资源类型中指定资源类型所对应的资源;在所述指定资源发生故障时,按照预设策略选择接管所述指定资源的目标对象。According to one aspect of the present invention, a resource failure processing method is provided, including: monitoring whether a specified resource of a node in the network storage cluster system fails, wherein the specified resource is a pre-partitioned resource in the network storage cluster system A resource corresponding to the specified resource type in the resource type; when the specified resource fails, select a target object to take over the specified resource according to a preset strategy.
优选地,监测网络存储集群系统中节点的指定资源是否发生故障包括:对所述网络存储集群系统中所有节点的资源进行资源类型的划分;将所述所有节点中资源类型相同的资源配置为一个服务组;通过检测所述服务组中所述指定资源的状态判断所述指定资源是否发生故障。Preferably, monitoring whether a specified resource of a node in the network storage cluster system fails includes: dividing resources of all nodes in the network storage cluster system into resource types; configuring resources of the same resource type in all nodes as one A service group: judging whether the specified resource is faulty by detecting the status of the specified resource in the service group.
优选地,在以下情况下确定所述指定资源发生故障:当所述指定资源的物理网口状态由运行态转为备用态时,确定所述指定资源发生故障。Preferably, it is determined that the designated resource fails under the following conditions: when the state of the physical network port of the designated resource changes from the running state to the standby state, it is determined that the designated resource fails.
优选地,按照预设策略选择接管所述指定资源的目标对象,包括:在所述指定资源所在的服务组中选择接管所述指定资源的服务单元;将所述服务单元所在的节点作为所述目标对象。Preferably, selecting a target object to take over the specified resource according to a preset policy includes: selecting a service unit to take over the specified resource in the service group where the specified resource is located; using the node where the service unit is located as the target.
优选地,通过以下之一方式在所述资源所在的服务组中选择接管所述指定资源的服务单元:按照预设的优先级从所述服务组中选择所述服务单元;按照所述服务组中所述服务单元的IP地址取值选择所述服务单元。Preferably, the service unit that takes over the specified resource is selected in the service group where the resource is located in one of the following ways: select the service unit from the service group according to the preset priority; select the service unit according to the service group The value of the IP address of the service unit in selects the service unit.
优选地,在所述目标接管对象对所述发生故障的指定资源进行接管后,还包括:保存所述指定故障的切换信息,其中,所述切换信息包括以下至少之一:所述指定资源所在的原节点信息、所述指定资源对应的资源类型;当所述指定资源所在的原节点故障恢复时,根据所述切换信息将所述指定资源切换回所述原节点。Preferably, after the target takeover object takes over the failed specified resource, it further includes: saving switching information of the specified failure, wherein the switching information includes at least one of the following: The original node information of the specified resource and the resource type corresponding to the specified resource; when the original node where the specified resource is located fails and recovers, switch the specified resource back to the original node according to the switching information.
根据本发明实施例的另一个方面,还提供了一种资源的故障处理装置,包括:监测模块,用于监测网络存储集群系统中节点的指定资源是否发生故障,其中,所述指定资源为所述网络存储集群系统中预先划分的资源类型中指定资源类型所对应的资源;选择模块,用于在所述指定资源发生故障时,按照预设策略选择接管所述指定资源的目标对象。According to another aspect of the embodiments of the present invention, there is also provided a resource failure processing device, including: a monitoring module, configured to monitor whether a specified resource of a node in a network storage cluster system fails, wherein the specified resource is the A resource corresponding to a specified resource type among the pre-divided resource types in the network storage cluster system; a selection module configured to select a target object to take over the specified resource according to a preset policy when the specified resource fails.
优选地,所述监测模块包括:划分单元,用于对所述网络存储集群系统中所有节点的资源进行资源类型的划分;配置单元,用于将所述所有节点中资源类型相同的资源配置为一个服务组;判断单元,用于通过检测所述服务组中所述指定资源的状态判断所述指定资源是否发生故障。Preferably, the monitoring module includes: a division unit, configured to divide resources of all nodes in the network storage cluster system into resource types; a configuration unit, configured to configure resources of the same resource type in all nodes as A service group; a judging unit, configured to judge whether the designated resource is faulty by detecting the status of the designated resource in the service group.
优选地,所述判断单元用于当所述指定资源的物理网口状态由运行态转为备用态时,确定所述指定资源发生故障。Preferably, the judging unit is configured to determine that the specified resource is faulty when the state of the physical network port of the specified resource changes from a running state to a standby state.
优选地,所述选择模块,包括:选择单元,用于在所述指定资源所在的服务组中选择接管所述指定资源的服务单元;确定单元,用于将所述服务单元所在的节点作为所述服务单元。Preferably, the selection module includes: a selection unit, configured to select a service unit to take over the specified resource in the service group where the specified resource is located; a determination unit, configured to use the node where the service unit is located as the specified resource the service unit described above.
通过本发明,采用对节点上的资源进行分类后,当指定资源发生故障时,可以仅将发生故障的资源转移到其他节点上的技术方案,解决了相关技术中由于很多情况下节点上的资源故障都属于部分故障,但仍然将该节点隔离,将节点的业务转移到其他接管节点上而导致的接管流程复杂,容易出错,同时也增加了接管节点的负载的问题,简化了接管流程,降低了出错率,同时也较少了接管节点的负载负担。Through the present invention, after classifying the resources on the nodes, when the specified resources fail, the technical solution that only the failed resources can be transferred to other nodes solves the problem of resources on the nodes in many cases in the related art. The faults are all partial faults, but the node is still isolated, and the takeover process caused by transferring the node's business to other takeover nodes is complicated and error-prone. At the same time, it also increases the load of the takeover node, simplifies the takeover process, and reduces The error rate is reduced, and the load burden on the takeover node is also reduced.
附图说明Description of drawings
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:The accompanying drawings described here are used to provide a further understanding of the present invention and constitute a part of the application. The schematic embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute improper limitations to the present invention. In the attached picture:
图1是根据本发明实施例的资源的故障处理方法的流程图;FIG. 1 is a flow chart of a resource fault handling method according to an embodiment of the present invention;
图2是根据本发明实施例的资源的故障处理装置的结构框图;FIG. 2 is a structural block diagram of a resource failure processing device according to an embodiment of the present invention;
图3是根据本发明实施例的资源的故障处理装置的另一结构框图;Fig. 3 is another structural block diagram of a resource failure processing device according to an embodiment of the present invention;
图4为根据本发明优选实施例的资源保护组模型示意图;Fig. 4 is a schematic diagram of a resource protection group model according to a preferred embodiment of the present invention;
图5为根据本发明优选实施例的资源的故障处理流程图;Fig. 5 is a flow chart of fault handling of resources according to a preferred embodiment of the present invention;
图6为根本发明优选实施例的资源切回流程图。Fig. 6 is a flow chart of resource switchback in the preferred embodiment of the present invention.
具体实施方式detailed description
下文中将参考附图并结合实施例来详细说明本发明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。Hereinafter, the present invention will be described in detail with reference to the drawings and examples. It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other.
在本实施例中提供了一种资源的故障处理方法,图1是根据本发明实施例的资源的故障处理方法的流程图,如图1所示,该流程包括如下步骤:In this embodiment, a resource fault handling method is provided. FIG. 1 is a flowchart of a resource fault handling method according to an embodiment of the present invention. As shown in FIG. 1 , the process includes the following steps:
步骤S102,监测网络存储集群系统中节点的指定资源是否发生故障,其中,上述指定资源为上述网络存储集群系统中预先划分的资源类型中指定资源类型所对应的资源;Step S102, monitoring whether a specified resource of a node in the network storage cluster system fails, wherein the above-mentioned specified resource is a resource corresponding to a specified resource type among the pre-divided resource types in the above-mentioned network storage cluster system;
步骤S104,在上述指定资源发生故障时,按照预设策略选择接管上述指定资源的目标对象。Step S104, when the designated resource fails, select a target object to take over the designated resource according to a preset strategy.
通过上述各个步骤,采用对节点上的资源进行分类后,当分类后的其中一个类型的指定资源发生故障时,可以仅将发生故障的指定资源转移到其他节点上的技术方案,解决了相关技术中很多情况下节点上的资源故障都属于部分故障,但仍然将该节点隔离,将节点的业务转移到其他接管节点上而导致的接管流程复杂,容易出错,同时也增加了接管节点的负载的问题,简化了接管流程,降低了出错率,同时也较少了接管节点的负载负担,也就是说,采用本发明实施例的技术方案:接管节点只接管有问题的部分资源,由于故障所在节点没有隔离,要避免资源出现多端加载,保证业务的一致性,持续对外提供服务。Through the above steps, after the resources on the nodes are classified, when one of the classified resources fails, only the failed designated resources can be transferred to other nodes, which solves the problem of related technologies In many cases, the resource failure on the node is a partial failure, but the node is still isolated, and the business of the node is transferred to other takeover nodes, which leads to a complex takeover process, which is prone to errors, and also increases the load on the takeover node. The problem is that the takeover process is simplified, the error rate is reduced, and the load burden on the takeover node is also reduced. Without isolation, it is necessary to avoid multi-terminal loading of resources, ensure business consistency, and continue to provide external services.
可选地,上述步骤S102可以有多种实现方式,在本发明实施例的一个示例中,可以采用如下技术方案实现:对网络存储集群系统中所有节点的资源进行资源类型的划分;将上述所有节点中资源类型相同的资源配置为一个服务组;通过检测上述服务组中上述指定资源的状态判断上述指定资源是否发生故障,即将网络存储集群系统中所有节点中相同资源类型的资源都逻辑上划分为一个服务组内,检测上述具有同一资源类型的服务组中的资源是否发生故障,由于一个服务组中对应的是一个资源类型,可以方便快捷的检测出故障类型,且便于对资源进行管理。Optionally, the above step S102 can be implemented in multiple ways. In an example of the embodiment of the present invention, the following technical solution can be used to implement: divide the resources of all nodes in the network storage cluster system into resource types; combine all the above The resources of the same resource type in the nodes are configured as a service group; by detecting the status of the above-mentioned specified resources in the above-mentioned service group, it is judged whether the above-mentioned specified resources have failed, that is, the resources of the same resource type in all nodes in the network storage cluster system are logically divided In a service group, it detects whether the resource in the service group with the same resource type is faulty. Since a service group corresponds to a resource type, the fault type can be detected conveniently and quickly, and resource management is facilitated.
由于所有物理网口均有两种状态:运行态(ACTIVE)和备用态(STANDBY),当指定资源的物理网口状态从运行态转为备用态时,可以判定述指定资源发生故障。Since all physical network ports have two states: running state (ACTIVE) and standby state (STANDBY), when the state of the physical network port of the specified resource changes from the running state to the standby state, it can be determined that the specified resource is faulty.
在本发明实施例的另一个可选实施例中,上述步骤S104可以通过以下方式实现:在上述指定资源所在的服务组中选择接管指定资源的服务单元;将上述服务单元所在的节点作为上述目标对象,在监测服务组内的资源时,当监测到指定资源发生故障时,可以在同一服务组内查找与发生故障的指定资源为同一资源类型的资源所对应的服务单元,在确定服务单元后,该服务单元所在的节点即为上述目标对象(也可以理解为是接管节点)。In another optional embodiment of the embodiment of the present invention, the above step S104 can be implemented in the following manner: select a service unit to take over the specified resource in the service group where the above-mentioned specified resource is located; take the node where the above-mentioned service unit is located as the above-mentioned target Object, when monitoring resources in a service group, when a failure of a specified resource is detected, the service unit corresponding to the resource of the same resource type as the specified resource that has failed can be found in the same service group. After the service unit is determined , the node where the service unit is located is the above-mentioned target object (also can be understood as a takeover node).
为了保证系统中节点业务的一致性,在目标对象对指定资源进行接管后,本发明实施例还提供了以下技术方案:在目标对象接管执行资源后,保存上述指定故障的切换信息,其中,上述切换信息包括以下至少之一:上述指定资源所在的原节点信息、上述指定资源对应的资源类型;当上述指定资源所在的原节点故障恢复时,根据上述切换信息将上述指定资源切换回上述原节点。In order to ensure the consistency of node services in the system, after the target object takes over the specified resource, the embodiment of the present invention also provides the following technical solution: after the target object takes over the execution resource, save the switchover information of the above-mentioned specified failure, wherein, the above-mentioned The switching information includes at least one of the following: the information of the original node where the above-mentioned specified resource is located, and the resource type corresponding to the above-mentioned specified resource; when the original node where the above-mentioned specified resource is located fails and recovers, switch the above-mentioned specified resource back to the above-mentioned original node according to the above-mentioned switching information .
综上所述,本发明实施例提供了一种网络附属存储集群高可用机制,解决了目前网络附属存储运行节点数据丢失、网络负载高、资源多端加载等部分故障问题。To sum up, the embodiments of the present invention provide a high-availability mechanism for network-attached storage clusters, which solves the problems of current network-attached storage operation node data loss, high network load, multi-terminal loading of resources and other partial failure problems.
为了更好的理解上述资源的故障处理过程,以下结合一个优选实施例进行说明,但不限定本发明实施例。In order to better understand the fault handling process of the above resources, a preferred embodiment will be described below, but this embodiment of the present invention is not limited.
首先,对本发明优选实施例中涉及到的名词简单解释如下:At first, the noun involved in the preferred embodiment of the present invention is simply explained as follows:
服务实例:保护资源(可以理解为是上述服务组中的资源)的基本单位,在网络附属存储集群中,对应网络虚拟网口和虚拟盘对象的集合。以虚拟网口为例说明,虚拟网口是对当前提供网络连接的若干物理网口聚合的抽象,在整个集群范围内具有唯一性。虚拟网口绑定在ACTIVE状态的物理网口上,该物理网口承载对外虚拟网口上的所有业务。当ACTIVE状态物理网口出现异常时,通过配置策略从STANDBY态保护资源集合中选举出目标对象进行接管,保证虚拟网口对外业务的不中断。Service instance: the basic unit of protection resources (which can be understood as the resources in the above-mentioned service group), in the network-attached storage cluster, it corresponds to the collection of network virtual network ports and virtual disk objects. Taking the virtual network port as an example, the virtual network port is an abstraction of the aggregation of several physical network ports that currently provide network connections, and is unique within the entire cluster. The virtual network port is bound to the physical network port in the ACTIVE state, and the physical network port carries all services on the external virtual network port. When the physical network port in the ACTIVE state is abnormal, the target object is selected from the protected resource set in the STANDBY state through the configuration policy to take over, ensuring that the external services of the virtual network port are not interrupted.
服务单元:一个具备完整功能的个体,在集群中各节点上部署,可承担服务实例的指派。存储集群系统中每个节点上包含前端网口和后端虚拟盘对象两个服务实例组成的服务单元,假定当前网络附属存储集群系统中有N个节点,一个服务单元只能承担N份ACTIVE的服务实例指派,N份STANDBY的服务实例指派。Service unit: An individual with complete functions, which is deployed on each node in the cluster and can undertake the assignment of service instances. Each node in the storage cluster system includes a service unit consisting of two service instances, the front-end network port and the back-end virtual disk object. Assuming that there are N nodes in the current network-attached storage cluster system, one service unit can only undertake N copies of ACTIVE Service instance assignment, N copies of STANDBY service instance assignment.
服务组:由一个或多个服务单元上同种资源类型对象组成的集合,多个服务组中具体对象组成服务单元。以虚拟网口为例说明,承载虚拟网口业务的所有物理网口集合组成虚拟网口的服务组。每个服务组有的主备策略,服务组之间完全独立,互不影响。每个服务组有其唯一标识,该标识在创建时指定,且在网络附属存储集群系统范围内唯一。Service group: A collection of objects of the same resource type on one or more service units, and specific objects in multiple service groups form a service unit. Taking the virtual network port as an example, all the physical network ports that bear the service of the virtual network port form the service group of the virtual network port. Each service group has an active/standby strategy, and the service groups are completely independent and do not affect each other. Each service group has its unique ID, which is specified when it is created and is unique within the network-attached storage cluster system.
归属节点:存储前后端虚拟资源在创建时指定,同一个虚拟资源只能归属于一个节点,上电时优先选择归属节点上服务单元对象作为ACTIVE的服务实例指派。Belonging node: The storage front-end and back-end virtual resources are specified when they are created. The same virtual resource can only belong to one node. When powering on, the service unit object on the home node is preferentially selected as the ACTIVE service instance.
配置策略:前后端虚拟资源在创建时指定,资源异常时根据该策略选择服务单元对象进行接管,默认按IP地址取值比较小的IP地址对应的服务单元优先接管,同时,提供接口支持人工干预,对服务单元对象配置不同权值,取权值大的优先接管发生故障的资源。Configuration strategy: The front-end and back-end virtual resources are specified at the time of creation. When the resource is abnormal, the service unit object is selected to take over according to this policy. By default, the service unit corresponding to the IP address with a relatively small IP address value is given priority to take over. At the same time, an interface is provided to support manual intervention , configure different weights for the service unit objects, and take over the failed resources first with the higher weight.
接管节点(可以理解为上述实施例的目标对象):当前后端资源ACTIVE服务单元出现异常时,根据配置策略从STANDBY节点中发起选举,产生新的ACTIVE服务单元对象,该服务单元对象所在节点称为接管节点。Takeover node (can be understood as the target object of the above embodiment): when the current back-end resource ACTIVE service unit is abnormal, an election is initiated from the STANDBY node according to the configuration policy, and a new ACTIVE service unit object is generated. The node where the service unit object is located is called to take over the node.
主决策节点:故障管理模块上电时选举产生的ACTIVE服务实例所在节点,当故障管理模块本身产生异常时,会重新发起选举,从而产生新的ACTIVE故障管理服务实例指派,新服务实例所在节点为新的主决策节点。Main decision-making node: the node where the ACTIVE service instance elected by the fault management module is powered on. When the fault management module itself is abnormal, the election will be re-initiated, thereby generating a new ACTIVE fault management service instance assignment. The node where the new service instance is located is New master decision node.
本发明优选实施例提供的技术方案可以大致总结为:通过定义保护资源模型和故障管理框架,管理网络附属存储前端网络和后端存储资源,达到整个存储集群资源的高可用。The technical solution provided by the preferred embodiment of the present invention can be roughly summarized as follows: by defining a protection resource model and a fault management framework, managing network-attached storage front-end network and back-end storage resources, high availability of the entire storage cluster resource is achieved.
当前后端部分资源出现异常时,对保护资源中部分资源异常进行心跳监控,一旦监控模块感知到异常后,告警通知故障管理模块;当故障管理模块接收到告警后,按照保护资源接管优先级决策需要接管的资源并进行接管,保证对外服务的连续性;同时记录该异常资源的切换信息;When some front-end and back-end resources are abnormal, heartbeat monitoring is performed on some of the resources in the protection resources. Once the monitoring module detects the abnormality, an alarm will be notified to the fault management module; when the fault management module receives the alarm, it will make decisions according to the protection resource takeover priority Take over the resources that need to be taken over to ensure the continuity of external services; at the same time, record the switching information of the abnormal resource;
可选地,当故障解除后,故障模块状态自动同步到保护资源组中,监控模块感知该故障恢复,向故障管理模块执行故障恢复请求,故障管理模块根据异常资源的切换信息执行相应的切回操作。Optionally, when the fault is resolved, the state of the faulty module is automatically synchronized to the protection resource group, the monitoring module perceives the fault recovery, and executes a fault recovery request to the fault management module, and the fault management module performs corresponding switchback according to the switching information of the abnormal resource operate.
在本发明实施例上述提供的技术方案中:资源保护组模型可以大致描述如下:每个节点上常驻一个监控模块,负责心跳监控管理、异常时在服务组内根据配置策略选举。该模块以守护线程形式常驻各节点,最早上电的节点为主决策节点,如果同时上电多节点,通过比较IP,选举较小IP地址值的节点为主决策节点。节点间通过远程过程调用协议(RemoteProcedureCallprotocol,简称为RPC)消息进行通信,正常情况下由主决策节点发起心跳检查,按服务组标识收集其它节点上服务单元状态信息,其它节点根据以下至少之一事件来判断决定是否重新发送信标进行新的选举:1.定时心跳检查时间是否超过了最大检查时间;2.当前ACTIVE状态的服务单元是否出现异常,在满足上述条件之一时,会向所有集群中的站点发送信标,发起ACTIVE服务单元的选举。In the technical solution provided above in the embodiment of the present invention: the resource protection group model can be roughly described as follows: a monitoring module is resident on each node, which is responsible for heartbeat monitoring and management, and is elected in the service group according to the configuration policy in case of abnormality. This module is resident in each node in the form of a daemon thread. The node that is powered on the earliest is the main decision-making node. If multiple nodes are powered on at the same time, the node with a smaller IP address value is elected as the main decision-making node by comparing IPs. Nodes communicate through Remote Procedure Call protocol (RPC) messages. Under normal circumstances, the main decision-making node initiates a heartbeat check, and collects service unit status information on other nodes according to the service group identifier. Other nodes are based on at least one of the following events: To determine whether to resend the beacon for a new election: 1. Whether the regular heartbeat check time exceeds the maximum check time; 2. Whether the service unit in the current ACTIVE state is abnormal. When one of the above conditions is met, it will report to all clusters The station sends a beacon to initiate the election of the ACTIVE service unit.
通过故障管理服务标识选举出的主决策节点故障管理模块管理整个存储的前后端资源,前后端资源中由ACTIVE服务单元来执行该服务实例的工作,所有业务承载于该服务实例上,其它各个服务单元处于该服务实例的STANDBY状态,在监控到ACTIVE服务单元异常后,该故障管理模块负责整个接管协作,具体流程协作通过以下过程实现:The failure management module of the main decision-making node elected by the fault management service identifier manages the front-end and back-end resources of the entire storage. The ACTIVE service unit in the front-end and back-end resources performs the work of the service instance, and all services are carried on the service instance. Other services The unit is in the STANDBY state of the service instance. After monitoring the abnormality of the ACTIVE service unit, the fault management module is responsible for the entire takeover cooperation. The specific process cooperation is realized through the following process:
步骤1:在各节点配置虚拟网口和虚拟盘共享存储服务组,前端虚拟网络服务组用于用户存储网络接入,后端虚拟盘存储服务组用于存放共享存储数据资源;Step 1: Configure virtual network ports and virtual disk shared storage service groups on each node. The front-end virtual network service group is used for user storage network access, and the back-end virtual disk storage service group is used to store shared storage data resources;
步骤2:将所有虚拟资源指定归属节点,注册配置资源进资源服务单元,正常情况下,虚拟资源真实运行于归属节点上的服务单元中,该服务单元为ACTIVE状态;Step 2: Assign all virtual resources to the home node, register and configure resources into the resource service unit, under normal circumstances, virtual resources actually run in the service unit on the home node, and the service unit is in the ACTIVE state;
步骤3:监控模块对所有资源保护组资源进行实时心跳监控,一旦发现保护资源组内运行资源出现异常则发出告警;Step 3: The monitoring module performs real-time heartbeat monitoring on all resource protection group resources, and sends an alarm once it finds that the running resources in the protection resource group are abnormal;
步骤4:故障管理模块接收到异常,下线当前运行异常的服务组内服务单元资源;Step 4: The fault management module receives an exception, and offline the service unit resources in the service group that are currently running abnormally;
步骤5:根据当前节点和服务组标识,根据配置策略选取出目标接管服务单元对象进行迁移并记录保存,设置新的服务单元为ACTIVE状态;Step 5: According to the current node and service group ID, select the target takeover service unit object according to the configuration policy for migration and record saving, and set the new service unit to ACTIVE state;
步骤6:当出现异常的前后端资源恢复正常之后,将自动更新资源服务组,并通知故障管理模块;Step 6: When the abnormal front-end and back-end resources return to normal, the resource service group will be automatically updated and the fault management module will be notified;
步骤7:故障管理模块根据异常时的迁移记录,切换回其上的运行资源。故障恢复,同时调整两个服务单元对象的状态。Step 7: The fault management module switches back to the running resource on it according to the migration record at the time of abnormality. Failure recovery, adjust the state of two service unit objects at the same time.
本发明优选实施例达到了以下技术效果:通过资源保护组模型,将集群节点按前端网络资源、后端存储资源进行细化,节点部分资源异常场景下,支持只接管节点异常部分,保留节点正常运行部分。从而提高了整体性能,实现网络附属存储群集资源的有效利用;满足关键业务高可用性、稳定性和扩展性的要求,可用于高可用存储集群多机热备要求的故障检测、接管决策、故障隔离与切换、恢复与扩展;通过对Paxos算法进行改进,按节点和服务组标识支持多实例选举,提高选举灵活性,故障管理模块本身加入保护资源组进行热备,简化系统实现,有效解决主决策节点上故障管理模块本身异常问题;在集群系统内部署热备主机,充分利用主机自身运算能力,提升接管响应速度,降低成本开支。The preferred embodiment of the present invention achieves the following technical effects: through the resource protection group model, the cluster nodes are refined according to the front-end network resources and the back-end storage resources. In the case of abnormal node resources, it is supported to only take over the abnormal part of the node and keep the normal node. run part. Thereby improving the overall performance and realizing the effective utilization of network-attached storage cluster resources; meeting the requirements of high availability, stability and scalability of key businesses, and can be used for fault detection, takeover decision-making, and fault isolation required by multi-machine hot backup of high-availability storage clusters and switching, recovery and expansion; by improving the Paxos algorithm, multi-instance elections are supported according to node and service group identifiers, which improves the flexibility of elections, and the fault management module itself is added to the protection resource group for hot backup, which simplifies system implementation and effectively solves the main decision The fault management module itself on the node is abnormal; deploy a hot standby host in the cluster system to make full use of the host's own computing power, improve the takeover response speed, and reduce costs.
在本实施例中还提供了一种资源的故障处理装置,该装置用于实现上述实施例及优选实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。This embodiment also provides a resource failure processing device, which is used to implement the above embodiments and preferred implementation modes, and what has already been described will not be repeated. As used below, the term "module" may be a combination of software and/or hardware that realizes a predetermined function. Although the devices described in the following embodiments are preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.
图2是根据本发明实施例的资源的故障处理装置的结构框图,如图2所示,该装置包括:Fig. 2 is a structural block diagram of a resource failure processing device according to an embodiment of the present invention. As shown in Fig. 2, the device includes:
监测模块20,用于监测网络存储集群系统中节点的指定资源是否发生故障,其中,上述指定资源为上述网络存储集群系统中预先划分的资源类型中指定资源类型所对应的资源;The monitoring module 20 is configured to monitor whether a specified resource of a node in the network storage cluster system fails, wherein the above-mentioned specified resource is a resource corresponding to a specified resource type among the pre-divided resource types in the above-mentioned network storage cluster system;
选择模块22,与监测模块20连接,用于在上述指定资源发生故障时,按照预设策略选择接管上述指定资源的目标对象。The selection module 22 is connected with the monitoring module 20, and is used for selecting a target object to take over the designated resource according to a preset policy when the designated resource fails.
通过上述各个模块的综合作用,采用对节点上的资源进行分类后,当分类后的其中一个类型的指定资源发生故障时,可以仅将发生故障的指定资源转移到其他节点上的技术方案,解决了相关技术中很多情况下节点上的资源故障都属于部分故障,但仍然将该节点隔离,将节点的业务转移到其他接管节点上而导致的接管流程复杂,容易出错,同时也增加了接管节点的负载的问题,简化了接管流程,降低了出错率,同时也较少了接管节点的负载负担Through the comprehensive functions of the above-mentioned modules, after classifying the resources on the nodes, when one of the classified resources fails, only the designated resources that have failed can be transferred to other nodes to solve the problem. In related technologies, in many cases, the resource failure on a node is a partial failure, but the node is still isolated, and the business of the node is transferred to other takeover nodes, resulting in a complicated takeover process and error-prone, which also increases the number of takeover nodes. The problem of load, simplifies the takeover process, reduces the error rate, and also reduces the load burden on the takeover node
图3是根据本发明实施例的资源的故障处理装置的另一结构框图,如图3所示,该装置除包括图2所示的所有模块外,还包括:Fig. 3 is another structural block diagram of a resource failure processing device according to an embodiment of the present invention. As shown in Fig. 3, in addition to all the modules shown in Fig. 2, the device also includes:
监测模块20为了实现上述监测网络存储集群系统中节点的指定资源是否发生故障的功能,在本发明实施例的一个可选实施例中,监测模块20可以包括如下单元:划分单元200,用于对上述网络存储集群系统中所有节点的资源进行资源类型的划分;配置单元202,与划分单元200连接,与划分单元用于将上述所有节点中资源类型相同的资源配置为一个服务组;判断单元204,与配置单元202连接,用于通过检测上述服务组中上述指定资源的状态判断上述指定资源是否发生故障,其中,判断单元204用于当上述指定资源的物理网口状态由运行态转为备用态时,确定上述指定资源发生故障。In order to realize the above-mentioned function of monitoring whether a specified resource of a node in the network storage cluster system fails, the monitoring module 20, in an optional embodiment of the embodiment of the present invention, the monitoring module 20 may include the following unit: a dividing unit 200 for Resource types of all nodes in the above-mentioned network storage cluster system are divided into resource types; the configuration unit 202 is connected to the division unit 200, and the division unit is used to configure resources of the same resource type in all the above-mentioned nodes as a service group; the judgment unit 204 , connected to the configuration unit 202, for judging whether the above-mentioned designated resource fails by detecting the state of the above-mentioned designated resource in the above-mentioned service group, wherein, the judging unit 204 is used for when the state of the physical network port of the above-mentioned designated resource changes from the running state to the standby state, it is determined that the above-mentioned specified resource is faulty.
可选地,选择模块22还可以包括如下单元:选择单元220,用于在上述指定资源所在的服务组中选择接管上述指定资源的服务单元;确定单元222,与选择单元220连接,用于将上述服务单元所在的节点作为上述目标对象。Optionally, the selection module 22 may also include the following units: a selection unit 220, configured to select a service unit to take over the specified resource in the service group where the specified resource is located; a determination unit 222, connected to the selection unit 220, for The node where the above-mentioned service unit is located serves as the above-mentioned target object.
在本发明实施例中,选择模块22中的目标对象可以理解为上述实施例的接管节点。In the embodiment of the present invention, the target object in the selection module 22 can be understood as the takeover node in the above embodiment.
结合以下优选实施例对本发明实施例的技术方案进一步详细阐述:The technical solutions of the embodiments of the present invention are further elaborated in conjunction with the following preferred embodiments:
图4为根据本发明优选实施例的资源保护组模型示意图,如图4所示,有两个服务组:虚拟网口服务组和虚拟盘服务组,有两个服务实例:虚拟网口服务实例和虚拟盘服务实例。虚拟网口服务实例由虚拟网口服务组来保护执行,虚拟盘服务实例由虚拟盘服务组来保护执行。其中,实线箭头指向ACTIVE服务单元对象,实际上承载业务,虚线箭头指向STANDBY服务单元对象,异常时指派出新ACTIVE单元接管对象。Fig. 4 is a schematic diagram of a resource protection group model according to a preferred embodiment of the present invention. As shown in Fig. 4, there are two service groups: a virtual network port service group and a virtual disk service group, and there are two service instances: a virtual network port service instance and virtual disk service instance. The virtual network port service instance is protected and executed by the virtual network port service group, and the virtual disk service instance is protected and executed by the virtual disk service group. Among them, the solid line arrow points to the ACTIVE service unit object, which actually carries services, and the dotted line arrow points to the STANDBY service unit object, and a new ACTIVE unit is assigned to take over the object in case of an exception.
由图4所提供的示意图可以知晓:虚拟网口服务组内,安排服务单元3执行虚拟网口服务实例的ACTIVE工作,服务单元1和服务单元2执行虚拟网口服务服务实例的STANDBY工作,图4中虚拟盘网口服务实例和虚拟盘服务实例与服务单元中的连线实线代表的是ACTIVE;虚线连接为STANDBY指派。From the schematic diagram provided in Figure 4, it can be known that in the virtual network port service group, service unit 3 is arranged to perform the ACTIVE work of the virtual network port service instance, and service unit 1 and service unit 2 perform the STANDBY work of the virtual network port service instance, as shown in Fig. The connection between the virtual disk network port service instance and the virtual disk service instance and the service unit in 4. The solid line represents ACTIVE; the dotted line connection is STANDBY designation.
虚拟盘服务组内,安排服务单元2执行虚拟盘服务实例的ACTIVE工作,服务单元1和服务单元3执行虚拟盘服务实例的STANDBY工作。In the virtual disk service group, arrange service unit 2 to execute the ACTIVE work of the virtual disk service instance, and service unit 1 and service unit 3 to execute the STANDBY work of the virtual disk service instance.
图5为根据本发明优选实施例的资源的故障处理流程图,如图5所示:Fig. 5 is a flowchart of fault handling of resources according to a preferred embodiment of the present invention, as shown in Fig. 5:
在节点的部分资源异常场景中,资源故障触发的整个接管流程:In the scenario where some resources of a node are abnormal, the entire takeover process triggered by a resource failure:
步骤S502:资源归属节点业务保护资源状态发生变化(由设备故障或者人机命令触发),从ACTIVE转变为STANDBY状态,通知本节点上监控代理模块;Step S502: The state of the service protection resource of the resource belonging node changes (triggered by equipment failure or man-machine command), changes from ACTIVE to STANDBY state, and notifies the monitoring agent module on this node;
步骤S504:主决策节点监控模块通过定时心跳与各节点监控代理通信,感知到对应类型的保护资源状态异常,向本节点故障管理模块发送切换请求;Step S504: The main decision-making node monitoring module communicates with each node monitoring agent through a regular heartbeat, perceives that the state of the corresponding type of protection resource is abnormal, and sends a switching request to the fault management module of this node;
步骤S506:故障管理模块通知异常归属节点代理模块将受影响的资源下线,执行资源下线操作,进行资源清理后向主决策节点故障管理模块回复资源下线响应;Step S506: The fault management module notifies the abnormal home node agent module to take the affected resource offline, executes the resource offline operation, and returns a resource offline response to the main decision node fault management module after resource cleaning;
步骤S508:主决策节点故障管理模块收到资源下线响应,根据配置策略,选举出该异常资源的接管节点,并向接管节点代理模块发送资源上线请求;Step S508: The main decision node failure management module receives the resource offline response, elects a takeover node for the abnormal resource according to the configuration policy, and sends a resource online request to the takeover node agent module;
步骤S510:目标节点代理模块收到资源上线请求,向业务模块执行资源上线操作后,通知主决策节点故障管理模块,回复资源上线响应;Step S510: After receiving the resource online request, the agent module of the target node notifies the fault management module of the main decision-making node after performing the resource online operation to the business module, and replies with a resource online response;
步骤S512:主决策节点故障管理模块收到资源上线响应,认为切换完成,向本节点监控模块回复切换响应,流程结束。Step S512: The failure management module of the main decision-making node receives the resource online response, considers that the switching is completed, and replies the switching response to the monitoring module of this node, and the process ends.
图6为根本发明优选实施例的资源切回流程图,如图6所示:Fig. 6 is a flow chart of resource switching back in a preferred embodiment of the fundamental invention, as shown in Fig. 6:
在节点的部分资源异常恢复场景中,资源故障恢复触发的整个切回流程:In the scenario where some resources of a node are abnormally restored, the entire switchback process triggered by resource failure recovery:
步骤S602:资源归属节点业务保护资源状态发生变化(由设备故障恢复或者人机命令触发)从STANDBY转变为ACTIVE状态,通知本节点上监控代理模块;Step S602: The state of the service protection resource of the resource belonging node changes (triggered by equipment failure recovery or man-machine command) from STANDBY to ACTIVE state, and notifies the monitoring agent module on this node;
步骤S604:主决策节点监控模块通过定时心跳与各节点监控代理通信,感知到对应类型的活动保护资源状态恢复,向本节点故障管理模块发送切换请求;Step S604: The main decision-making node monitoring module communicates with each node monitoring agent through a regular heartbeat, perceives that the state of the corresponding type of active protection resource is restored, and sends a switching request to the fault management module of this node;
步骤S606:故障管理模块通知接管节点代理模块将资源下线,进行资源清理后向主决策节点故障管理模块回复资源下线响应;Step S606: the fault management module notifies the agent module of the takeover node to take the resources offline, and after cleaning up the resources, returns a resource offline response to the main decision node fault management module;
步骤S608:主决策节点故障管理模块收到资源下线响应,向原归属节点代理模块发送资源上线请求;Step S608: The main decision node failure management module receives the resource offline response, and sends a resource online request to the original home node agent module;
步骤S610:资源归属节点代理模块收到资源上线请求,向业务模块执行资源上线操作后,向主决策节点故障管理模块回复资源上线响应;Step S610: After receiving the resource online request, the agent module of the resource home node executes the resource online operation to the business module, and then returns a resource online response to the fault management module of the master decision node;
步骤S612:主决策节点故障管理模块收到资源上线响应,认为切换完成,向本节点监控模块回复切回响应,流程结束。Step S612: The failure management module of the main decision-making node receives the resource online response, considers that the switchover is completed, and returns a switchback response to the monitoring module of the local node, and the process ends.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。Through the description of the above embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is Better implementation.
在另外一个实施例中,还提供了一种软件,该软件用于执行上述实施例及优选实施方式中描述的技术方案。In another embodiment, software is also provided, and the software is used to implement the technical solutions described in the above embodiments and preferred implementation manners.
在另外一个实施例中,还提供了一种存储介质,该存储介质中存储有上述软件,该存储介质包括但不限于:光盘、软盘、硬盘、可擦写存储器等。In another embodiment, there is also provided a storage medium, in which the software is stored, the storage medium includes but not limited to: optical discs, floppy disks, hard disks, rewritable memories, and the like.
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的对象在适当情况下可以互换,以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first" and "second" in the description and claims of the present invention and the above drawings are used to distinguish similar objects, but not necessarily used to describe a specific sequence or sequence. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having", as well as any variations thereof, are intended to cover a non-exclusive inclusion, for example, a process, method, system, product or device comprising a sequence of steps or elements is not necessarily limited to the expressly listed instead, may include other steps or elements not explicitly listed or inherent to the process, method, product or apparatus.
综上所述,本发明实施例达到了以下技术效果:简化了接管流程,降低了出错率,同时也较少了接管节点的负载负担,也就是说,采用本发明实施例的技术方案:接管节点只接管有问题的部分资源,由于故障所在节点没有隔离,要避免资源出现多端加载,保证业务的一致性,持续对外提供服务。In summary, the embodiment of the present invention achieves the following technical effects: the takeover process is simplified, the error rate is reduced, and the load burden on the takeover node is also reduced, that is to say, the technical solution of the embodiment of the present invention is adopted: takeover Nodes only take over some resources that have problems. Since the node where the fault is located is not isolated, it is necessary to avoid multi-terminal loading of resources, ensure business consistency, and continue to provide external services.
显然,本领域的技术人员应该明白,上述的本发明的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。Obviously, those skilled in the art should understand that each module or each step of the above-mentioned present invention can be realized by a general-purpose computing device, and they can be concentrated on a single computing device, or distributed in a network formed by multiple computing devices Alternatively, they may be implemented in program code executable by a computing device so that they may be stored in a storage device to be executed by a computing device, and in some cases in an order different from that shown here The steps shown or described are carried out, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps among them are fabricated into a single integrated circuit module for implementation. As such, the present invention is not limited to any specific combination of hardware and software.
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.
Claims (10)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410545516.4A CN105515812A (en) | 2014-10-15 | 2014-10-15 | Fault processing method of resources and device |
PCT/CN2015/072923 WO2016058307A1 (en) | 2014-10-15 | 2015-02-12 | Fault handling method and apparatus for resource |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410545516.4A CN105515812A (en) | 2014-10-15 | 2014-10-15 | Fault processing method of resources and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105515812A true CN105515812A (en) | 2016-04-20 |
Family
ID=55723475
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410545516.4A Pending CN105515812A (en) | 2014-10-15 | 2014-10-15 | Fault processing method of resources and device |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN105515812A (en) |
WO (1) | WO2016058307A1 (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107247564A (en) * | 2017-07-17 | 2017-10-13 | 郑州云海信息技术有限公司 | A kind of method and system of data processing |
CN107276849A (en) * | 2017-06-15 | 2017-10-20 | 北京奇艺世纪科技有限公司 | The method for analyzing performance and device of a kind of cluster |
CN108289034A (en) * | 2017-06-21 | 2018-07-17 | 新华三大数据技术有限公司 | A kind of fault discovery method and apparatus |
CN111176783A (en) * | 2019-11-20 | 2020-05-19 | 航天信息股份有限公司 | High-availability method and device for container treatment platform and electronic equipment |
CN111865682A (en) * | 2020-07-16 | 2020-10-30 | 北京百度网讯科技有限公司 | Method and apparatus for handling faults |
CN111984463A (en) * | 2020-07-03 | 2020-11-24 | 浙江华云信息科技有限公司 | Micro application management method and device based on edge computing system |
CN112306813A (en) * | 2020-11-13 | 2021-02-02 | 苏州浪潮智能科技有限公司 | System alarm method and device |
CN112463535A (en) * | 2020-11-27 | 2021-03-09 | 中国工商银行股份有限公司 | Multi-cluster exception handling method and device |
CN114039836A (en) * | 2021-11-05 | 2022-02-11 | 光大科技有限公司 | Fault handling method and device of exporter collector |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111628958B (en) * | 2019-07-12 | 2022-08-05 | 国铁吉讯科技有限公司 | Network access method, device and system based on linear networking |
CN111200518B (en) * | 2019-12-25 | 2022-10-18 | 曙光信息产业(北京)有限公司 | Decentralized HPC computing cluster management method and system based on paxos algorithm |
CN111552556B (en) * | 2020-03-24 | 2023-06-09 | 北京中科云脑智能技术有限公司 | GPU cluster service management system and method |
CN112104727B (en) * | 2020-09-10 | 2021-11-30 | 华云数据控股集团有限公司 | Method and system for deploying simplified high-availability Zookeeper cluster |
CN114116122B (en) * | 2021-10-28 | 2025-03-25 | 北京银盾泰安网络科技有限公司 | A high-availability load platform for application containers |
CN114157585B (en) * | 2021-12-09 | 2024-09-20 | 京东科技信息技术有限公司 | Method and device for monitoring service resources |
CN114745557B (en) * | 2022-03-22 | 2024-05-24 | 浙江大华技术股份有限公司 | Disaster recovery operation execution method and device, storage medium and electronic device |
CN115134219A (en) * | 2022-06-29 | 2022-09-30 | 北京飞讯数码科技有限公司 | Device resource management method and device, computing device and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2000074304A2 (en) * | 1999-05-28 | 2000-12-07 | Teradyne, Inc. | Network fault isolation |
CN1926809A (en) * | 2004-03-04 | 2007-03-07 | 思科技术公司 | Methods and devices for high network availability |
CN1969494A (en) * | 2004-02-13 | 2007-05-23 | 阿尔卡特无线技术公司 | Method and system for providing availability and reliability for a telecommunication network entity |
CN201039274Y (en) * | 2007-02-09 | 2008-03-19 | 宋景明 | Modular pluggable board multi-function VoIP gateway |
CN101369241A (en) * | 2007-09-21 | 2009-02-18 | 中国科学院计算技术研究所 | A cluster fault-tolerant system, device and method |
CN102239665A (en) * | 2010-12-13 | 2011-11-09 | 华为技术有限公司 | Method and device for management service |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103167004A (en) * | 2011-12-15 | 2013-06-19 | 中国移动通信集团上海有限公司 | Cloud platform host system failure repair method and cloud platform front-end control server |
CN103617006A (en) * | 2013-11-28 | 2014-03-05 | 曙光信息产业股份有限公司 | Storage resource management method and device |
-
2014
- 2014-10-15 CN CN201410545516.4A patent/CN105515812A/en active Pending
-
2015
- 2015-02-12 WO PCT/CN2015/072923 patent/WO2016058307A1/en active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2000074304A2 (en) * | 1999-05-28 | 2000-12-07 | Teradyne, Inc. | Network fault isolation |
CN1969494A (en) * | 2004-02-13 | 2007-05-23 | 阿尔卡特无线技术公司 | Method and system for providing availability and reliability for a telecommunication network entity |
CN1926809A (en) * | 2004-03-04 | 2007-03-07 | 思科技术公司 | Methods and devices for high network availability |
CN201039274Y (en) * | 2007-02-09 | 2008-03-19 | 宋景明 | Modular pluggable board multi-function VoIP gateway |
CN101369241A (en) * | 2007-09-21 | 2009-02-18 | 中国科学院计算技术研究所 | A cluster fault-tolerant system, device and method |
CN102239665A (en) * | 2010-12-13 | 2011-11-09 | 华为技术有限公司 | Method and device for management service |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107276849A (en) * | 2017-06-15 | 2017-10-20 | 北京奇艺世纪科技有限公司 | The method for analyzing performance and device of a kind of cluster |
CN108289034A (en) * | 2017-06-21 | 2018-07-17 | 新华三大数据技术有限公司 | A kind of fault discovery method and apparatus |
CN107247564A (en) * | 2017-07-17 | 2017-10-13 | 郑州云海信息技术有限公司 | A kind of method and system of data processing |
CN111176783A (en) * | 2019-11-20 | 2020-05-19 | 航天信息股份有限公司 | High-availability method and device for container treatment platform and electronic equipment |
CN111984463A (en) * | 2020-07-03 | 2020-11-24 | 浙江华云信息科技有限公司 | Micro application management method and device based on edge computing system |
CN111865682A (en) * | 2020-07-16 | 2020-10-30 | 北京百度网讯科技有限公司 | Method and apparatus for handling faults |
CN111865682B (en) * | 2020-07-16 | 2023-08-08 | 北京百度网讯科技有限公司 | Method and device for handling faults |
CN112306813A (en) * | 2020-11-13 | 2021-02-02 | 苏州浪潮智能科技有限公司 | System alarm method and device |
CN112306813B (en) * | 2020-11-13 | 2023-03-14 | 苏州浪潮智能科技有限公司 | System alarm method and device |
CN112463535A (en) * | 2020-11-27 | 2021-03-09 | 中国工商银行股份有限公司 | Multi-cluster exception handling method and device |
CN112463535B (en) * | 2020-11-27 | 2024-05-10 | 中国工商银行股份有限公司 | Multi-cluster exception handling method and device |
CN114039836A (en) * | 2021-11-05 | 2022-02-11 | 光大科技有限公司 | Fault handling method and device of exporter collector |
Also Published As
Publication number | Publication date |
---|---|
WO2016058307A1 (en) | 2016-04-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105515812A (en) | Fault processing method of resources and device | |
CN108632067B (en) | Disaster recovery deployment method, device and system | |
CN1554055B (en) | High availability cluster virtual server system | |
CN111158962B (en) | Remote disaster recovery method, device and system, electronic equipment and storage medium | |
CN105099793B (en) | Hot standby method, device and system | |
WO2017000260A1 (en) | Method and apparatus for switching vnf | |
CN104158707A (en) | Method and device of detecting and processing brain split in cluster | |
CN111176888B (en) | Cloud storage disaster recovery method, device and system | |
CN104320274A (en) | Disaster tolerance method and device | |
CN104038376A (en) | Method and device for managing real servers and LVS clustering system | |
CN102497288A (en) | Dual-server backup method and dual system implementation device | |
CN103490914A (en) | Switching system and method for multi-machine hot backup of network application equipment | |
CN105634848B (en) | A kind of virtual router monitoring method and device | |
CN113872997A (en) | Container group POD reconstruction method and related equipment based on container cluster service | |
CN108347339A (en) | A kind of service restoration method and device | |
JP7206981B2 (en) | Cluster system, its control method, server, and program | |
EP3618350A1 (en) | Protection switching method, device and system | |
WO2024179028A1 (en) | Cloud technology-based detection method and cloud management platform | |
US11418382B2 (en) | Method of cooperative active-standby failover between logical routers based on health of attached services | |
CN104243304A (en) | Data processing method, device and system of locally-connected topological structure | |
CN116668269A (en) | Arbitration method, device and system for dual-activity data center | |
CN105490847A (en) | Real-time detecting and processing method of node failure in private cloud storage system | |
US10516625B2 (en) | Network entities on ring networks | |
CN104052799A (en) | Method for achieving high availability storage through resource rings | |
CN115499296B (en) | Cloud desktop hot standby management method, device and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20160420 |
|
WD01 | Invention patent application deemed withdrawn after publication |