[go: up one dir, main page]

CN105515812A - Fault processing method of resources and device - Google Patents

Fault processing method of resources and device Download PDF

Info

Publication number
CN105515812A
CN105515812A CN201410545516.4A CN201410545516A CN105515812A CN 105515812 A CN105515812 A CN 105515812A CN 201410545516 A CN201410545516 A CN 201410545516A CN 105515812 A CN105515812 A CN 105515812A
Authority
CN
China
Prior art keywords
resource
allocated resource
node
service
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410545516.4A
Other languages
Chinese (zh)
Inventor
陈重文
宋亚东
谢型果
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN201410545516.4A priority Critical patent/CN105515812A/en
Priority to PCT/CN2015/072923 priority patent/WO2016058307A1/en
Publication of CN105515812A publication Critical patent/CN105515812A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0663Performing the actions predefined by failover planning, e.g. switching to standby network elements

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Hardware Redundancy (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a fault processing method of resources and a device. The fault processing method comprises steps of monitoring whether an assigned resource of a node in a network storage cluster system break down, wherein the assigned resource is a corresponding resource of an assigned resource type in pre-divided resource types in the network storage cluster system; and selecting and taking charge of a target object of the assigned resource according to a preset strategy when the assigned resource break down. In this way, problems in the prior art can be solved that takeover procedures are complex, errors are easy to make, and load of a takeover node is increased since businesses of the node which is still isolated even resource faults on the node belongs to partial faults are transferred to other takeover nodes; takeover procedures are simplified; error rate is reduced; and load of the takeover node is reduced.

Description

资源的故障处理方法及装置Resource troubleshooting method and device

技术领域technical field

本发明涉及通信领域,具体而言,涉及一种资源的故障处理方法及装置。The present invention relates to the communication field, in particular, to a method and device for resource fault handling.

背景技术Background technique

网络附属存储系统广泛用于企业管理平台,其性能的安全可靠性可以直接关系到企业日常运营,因此网络附属存储系统需要保证稳定以及较高的可用性。Network-attached storage systems are widely used in enterprise management platforms, and the security and reliability of their performance can be directly related to the daily operations of enterprises. Therefore, network-attached storage systems need to ensure stability and high availability.

根据Gartner公司所作的统计,导致系统异常运行的原因主要可以主要分为以下几个方面:应用问题(40%)、操作问题(40%)、操作系统故障(10%)和硬件故障(10%),对于网络附属存储集群系统来说,很多情况也有可能是前端某个接入网口、后端某个存储资源的软硬件资源出现异常。在这种场景下,该节点上除了发生异常的模块不能运行之外,其它的模块都正常运行,此时现有技术中采用的技术方案是将整个节点隔离,把业务转移到其它能够正常运行的节点上去,而上述技术方案会使整个接管流程复杂,出错的概率也相应增加,同时整个接管耗时较长,接管成功后接管节点的负载也相应增加,给整个存储业务的过程都带来压力。According to the statistics made by Gartner, the causes of abnormal operation of the system can be mainly divided into the following aspects: application problems (40%), operation problems (40%), operating system failures (10%) and hardware failures (10%) ), for a network-attached storage cluster system, in many cases, it may also be that the software and hardware resources of a front-end access network port and a back-end storage resource are abnormal. In this scenario, except for the abnormal module that cannot run on the node, other modules are running normally. At this time, the technical solution adopted in the existing technology is to isolate the entire node and transfer the business to other nodes that can run normally. The above-mentioned technical solution will complicate the entire takeover process, and the probability of errors will increase accordingly. At the same time, the entire takeover will take a long time. pressure.

此外,当前网络存储集群中,故障管理模块主要是管理本节点上的存储资源,模块本身异常处理是通过节点的重新选举,产生新的接管节点来实现。选举算法以Paxos算法最为出名,在多个开源项目中使用到的,但是基本节点对象的单实例选举,无法解决节点内多个具体对象资源的选举。In addition, in the current network storage cluster, the fault management module mainly manages the storage resources on the local node, and the exception handling of the module itself is realized through the re-election of the node and the generation of a new takeover node. The most famous election algorithm is the Paxos algorithm, which is used in many open source projects, but the single-instance election of the basic node object cannot solve the election of multiple specific object resources in the node.

针对相关技术中,由于很多情况下节点上的资源故障都属于部分故障,但仍然将该节点隔离,将节点的业务转移到其他接管节点上而导致的接管流程复杂,容易出错,同时也增加了接管节点的负载的问题,尚未提出有效的解决方案。In related technologies, resource failures on nodes are partial failures in many cases, but the node is still isolated and the business of the node is transferred to other takeover nodes. The problem of taking over the load of nodes has not yet been proposed an effective solution.

发明内容Contents of the invention

为了解决上述技术问题,本发明提供了一种资源的故障处理方法及装置。In order to solve the above technical problems, the present invention provides a method and device for resource fault handling.

根据本发明的一个方面,提供了一种资源的故障处理方法,包括:监测网络存储集群系统中节点的指定资源是否发生故障,其中,所述指定资源为所述网络存储集群系统中预先划分的资源类型中指定资源类型所对应的资源;在所述指定资源发生故障时,按照预设策略选择接管所述指定资源的目标对象。According to one aspect of the present invention, a resource failure processing method is provided, including: monitoring whether a specified resource of a node in the network storage cluster system fails, wherein the specified resource is a pre-partitioned resource in the network storage cluster system A resource corresponding to the specified resource type in the resource type; when the specified resource fails, select a target object to take over the specified resource according to a preset strategy.

优选地,监测网络存储集群系统中节点的指定资源是否发生故障包括:对所述网络存储集群系统中所有节点的资源进行资源类型的划分;将所述所有节点中资源类型相同的资源配置为一个服务组;通过检测所述服务组中所述指定资源的状态判断所述指定资源是否发生故障。Preferably, monitoring whether a specified resource of a node in the network storage cluster system fails includes: dividing resources of all nodes in the network storage cluster system into resource types; configuring resources of the same resource type in all nodes as one A service group: judging whether the specified resource is faulty by detecting the status of the specified resource in the service group.

优选地,在以下情况下确定所述指定资源发生故障:当所述指定资源的物理网口状态由运行态转为备用态时,确定所述指定资源发生故障。Preferably, it is determined that the designated resource fails under the following conditions: when the state of the physical network port of the designated resource changes from the running state to the standby state, it is determined that the designated resource fails.

优选地,按照预设策略选择接管所述指定资源的目标对象,包括:在所述指定资源所在的服务组中选择接管所述指定资源的服务单元;将所述服务单元所在的节点作为所述目标对象。Preferably, selecting a target object to take over the specified resource according to a preset policy includes: selecting a service unit to take over the specified resource in the service group where the specified resource is located; using the node where the service unit is located as the target.

优选地,通过以下之一方式在所述资源所在的服务组中选择接管所述指定资源的服务单元:按照预设的优先级从所述服务组中选择所述服务单元;按照所述服务组中所述服务单元的IP地址取值选择所述服务单元。Preferably, the service unit that takes over the specified resource is selected in the service group where the resource is located in one of the following ways: select the service unit from the service group according to the preset priority; select the service unit according to the service group The value of the IP address of the service unit in selects the service unit.

优选地,在所述目标接管对象对所述发生故障的指定资源进行接管后,还包括:保存所述指定故障的切换信息,其中,所述切换信息包括以下至少之一:所述指定资源所在的原节点信息、所述指定资源对应的资源类型;当所述指定资源所在的原节点故障恢复时,根据所述切换信息将所述指定资源切换回所述原节点。Preferably, after the target takeover object takes over the failed specified resource, it further includes: saving switching information of the specified failure, wherein the switching information includes at least one of the following: The original node information of the specified resource and the resource type corresponding to the specified resource; when the original node where the specified resource is located fails and recovers, switch the specified resource back to the original node according to the switching information.

根据本发明实施例的另一个方面,还提供了一种资源的故障处理装置,包括:监测模块,用于监测网络存储集群系统中节点的指定资源是否发生故障,其中,所述指定资源为所述网络存储集群系统中预先划分的资源类型中指定资源类型所对应的资源;选择模块,用于在所述指定资源发生故障时,按照预设策略选择接管所述指定资源的目标对象。According to another aspect of the embodiments of the present invention, there is also provided a resource failure processing device, including: a monitoring module, configured to monitor whether a specified resource of a node in a network storage cluster system fails, wherein the specified resource is the A resource corresponding to a specified resource type among the pre-divided resource types in the network storage cluster system; a selection module configured to select a target object to take over the specified resource according to a preset policy when the specified resource fails.

优选地,所述监测模块包括:划分单元,用于对所述网络存储集群系统中所有节点的资源进行资源类型的划分;配置单元,用于将所述所有节点中资源类型相同的资源配置为一个服务组;判断单元,用于通过检测所述服务组中所述指定资源的状态判断所述指定资源是否发生故障。Preferably, the monitoring module includes: a division unit, configured to divide resources of all nodes in the network storage cluster system into resource types; a configuration unit, configured to configure resources of the same resource type in all nodes as A service group; a judging unit, configured to judge whether the designated resource is faulty by detecting the status of the designated resource in the service group.

优选地,所述判断单元用于当所述指定资源的物理网口状态由运行态转为备用态时,确定所述指定资源发生故障。Preferably, the judging unit is configured to determine that the specified resource is faulty when the state of the physical network port of the specified resource changes from a running state to a standby state.

优选地,所述选择模块,包括:选择单元,用于在所述指定资源所在的服务组中选择接管所述指定资源的服务单元;确定单元,用于将所述服务单元所在的节点作为所述服务单元。Preferably, the selection module includes: a selection unit, configured to select a service unit to take over the specified resource in the service group where the specified resource is located; a determination unit, configured to use the node where the service unit is located as the specified resource the service unit described above.

通过本发明,采用对节点上的资源进行分类后,当指定资源发生故障时,可以仅将发生故障的资源转移到其他节点上的技术方案,解决了相关技术中由于很多情况下节点上的资源故障都属于部分故障,但仍然将该节点隔离,将节点的业务转移到其他接管节点上而导致的接管流程复杂,容易出错,同时也增加了接管节点的负载的问题,简化了接管流程,降低了出错率,同时也较少了接管节点的负载负担。Through the present invention, after classifying the resources on the nodes, when the specified resources fail, the technical solution that only the failed resources can be transferred to other nodes solves the problem of resources on the nodes in many cases in the related art. The faults are all partial faults, but the node is still isolated, and the takeover process caused by transferring the node's business to other takeover nodes is complicated and error-prone. At the same time, it also increases the load of the takeover node, simplifies the takeover process, and reduces The error rate is reduced, and the load burden on the takeover node is also reduced.

附图说明Description of drawings

此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:The accompanying drawings described here are used to provide a further understanding of the present invention and constitute a part of the application. The schematic embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute improper limitations to the present invention. In the attached picture:

图1是根据本发明实施例的资源的故障处理方法的流程图;FIG. 1 is a flow chart of a resource fault handling method according to an embodiment of the present invention;

图2是根据本发明实施例的资源的故障处理装置的结构框图;FIG. 2 is a structural block diagram of a resource failure processing device according to an embodiment of the present invention;

图3是根据本发明实施例的资源的故障处理装置的另一结构框图;Fig. 3 is another structural block diagram of a resource failure processing device according to an embodiment of the present invention;

图4为根据本发明优选实施例的资源保护组模型示意图;Fig. 4 is a schematic diagram of a resource protection group model according to a preferred embodiment of the present invention;

图5为根据本发明优选实施例的资源的故障处理流程图;Fig. 5 is a flow chart of fault handling of resources according to a preferred embodiment of the present invention;

图6为根本发明优选实施例的资源切回流程图。Fig. 6 is a flow chart of resource switchback in the preferred embodiment of the present invention.

具体实施方式detailed description

下文中将参考附图并结合实施例来详细说明本发明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。Hereinafter, the present invention will be described in detail with reference to the drawings and examples. It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other.

在本实施例中提供了一种资源的故障处理方法,图1是根据本发明实施例的资源的故障处理方法的流程图,如图1所示,该流程包括如下步骤:In this embodiment, a resource fault handling method is provided. FIG. 1 is a flowchart of a resource fault handling method according to an embodiment of the present invention. As shown in FIG. 1 , the process includes the following steps:

步骤S102,监测网络存储集群系统中节点的指定资源是否发生故障,其中,上述指定资源为上述网络存储集群系统中预先划分的资源类型中指定资源类型所对应的资源;Step S102, monitoring whether a specified resource of a node in the network storage cluster system fails, wherein the above-mentioned specified resource is a resource corresponding to a specified resource type among the pre-divided resource types in the above-mentioned network storage cluster system;

步骤S104,在上述指定资源发生故障时,按照预设策略选择接管上述指定资源的目标对象。Step S104, when the designated resource fails, select a target object to take over the designated resource according to a preset strategy.

通过上述各个步骤,采用对节点上的资源进行分类后,当分类后的其中一个类型的指定资源发生故障时,可以仅将发生故障的指定资源转移到其他节点上的技术方案,解决了相关技术中很多情况下节点上的资源故障都属于部分故障,但仍然将该节点隔离,将节点的业务转移到其他接管节点上而导致的接管流程复杂,容易出错,同时也增加了接管节点的负载的问题,简化了接管流程,降低了出错率,同时也较少了接管节点的负载负担,也就是说,采用本发明实施例的技术方案:接管节点只接管有问题的部分资源,由于故障所在节点没有隔离,要避免资源出现多端加载,保证业务的一致性,持续对外提供服务。Through the above steps, after the resources on the nodes are classified, when one of the classified resources fails, only the failed designated resources can be transferred to other nodes, which solves the problem of related technologies In many cases, the resource failure on the node is a partial failure, but the node is still isolated, and the business of the node is transferred to other takeover nodes, which leads to a complex takeover process, which is prone to errors, and also increases the load on the takeover node. The problem is that the takeover process is simplified, the error rate is reduced, and the load burden on the takeover node is also reduced. Without isolation, it is necessary to avoid multi-terminal loading of resources, ensure business consistency, and continue to provide external services.

可选地,上述步骤S102可以有多种实现方式,在本发明实施例的一个示例中,可以采用如下技术方案实现:对网络存储集群系统中所有节点的资源进行资源类型的划分;将上述所有节点中资源类型相同的资源配置为一个服务组;通过检测上述服务组中上述指定资源的状态判断上述指定资源是否发生故障,即将网络存储集群系统中所有节点中相同资源类型的资源都逻辑上划分为一个服务组内,检测上述具有同一资源类型的服务组中的资源是否发生故障,由于一个服务组中对应的是一个资源类型,可以方便快捷的检测出故障类型,且便于对资源进行管理。Optionally, the above step S102 can be implemented in multiple ways. In an example of the embodiment of the present invention, the following technical solution can be used to implement: divide the resources of all nodes in the network storage cluster system into resource types; combine all the above The resources of the same resource type in the nodes are configured as a service group; by detecting the status of the above-mentioned specified resources in the above-mentioned service group, it is judged whether the above-mentioned specified resources have failed, that is, the resources of the same resource type in all nodes in the network storage cluster system are logically divided In a service group, it detects whether the resource in the service group with the same resource type is faulty. Since a service group corresponds to a resource type, the fault type can be detected conveniently and quickly, and resource management is facilitated.

由于所有物理网口均有两种状态:运行态(ACTIVE)和备用态(STANDBY),当指定资源的物理网口状态从运行态转为备用态时,可以判定述指定资源发生故障。Since all physical network ports have two states: running state (ACTIVE) and standby state (STANDBY), when the state of the physical network port of the specified resource changes from the running state to the standby state, it can be determined that the specified resource is faulty.

在本发明实施例的另一个可选实施例中,上述步骤S104可以通过以下方式实现:在上述指定资源所在的服务组中选择接管指定资源的服务单元;将上述服务单元所在的节点作为上述目标对象,在监测服务组内的资源时,当监测到指定资源发生故障时,可以在同一服务组内查找与发生故障的指定资源为同一资源类型的资源所对应的服务单元,在确定服务单元后,该服务单元所在的节点即为上述目标对象(也可以理解为是接管节点)。In another optional embodiment of the embodiment of the present invention, the above step S104 can be implemented in the following manner: select a service unit to take over the specified resource in the service group where the above-mentioned specified resource is located; take the node where the above-mentioned service unit is located as the above-mentioned target Object, when monitoring resources in a service group, when a failure of a specified resource is detected, the service unit corresponding to the resource of the same resource type as the specified resource that has failed can be found in the same service group. After the service unit is determined , the node where the service unit is located is the above-mentioned target object (also can be understood as a takeover node).

为了保证系统中节点业务的一致性,在目标对象对指定资源进行接管后,本发明实施例还提供了以下技术方案:在目标对象接管执行资源后,保存上述指定故障的切换信息,其中,上述切换信息包括以下至少之一:上述指定资源所在的原节点信息、上述指定资源对应的资源类型;当上述指定资源所在的原节点故障恢复时,根据上述切换信息将上述指定资源切换回上述原节点。In order to ensure the consistency of node services in the system, after the target object takes over the specified resource, the embodiment of the present invention also provides the following technical solution: after the target object takes over the execution resource, save the switchover information of the above-mentioned specified failure, wherein, the above-mentioned The switching information includes at least one of the following: the information of the original node where the above-mentioned specified resource is located, and the resource type corresponding to the above-mentioned specified resource; when the original node where the above-mentioned specified resource is located fails and recovers, switch the above-mentioned specified resource back to the above-mentioned original node according to the above-mentioned switching information .

综上所述,本发明实施例提供了一种网络附属存储集群高可用机制,解决了目前网络附属存储运行节点数据丢失、网络负载高、资源多端加载等部分故障问题。To sum up, the embodiments of the present invention provide a high-availability mechanism for network-attached storage clusters, which solves the problems of current network-attached storage operation node data loss, high network load, multi-terminal loading of resources and other partial failure problems.

为了更好的理解上述资源的故障处理过程,以下结合一个优选实施例进行说明,但不限定本发明实施例。In order to better understand the fault handling process of the above resources, a preferred embodiment will be described below, but this embodiment of the present invention is not limited.

首先,对本发明优选实施例中涉及到的名词简单解释如下:At first, the noun involved in the preferred embodiment of the present invention is simply explained as follows:

服务实例:保护资源(可以理解为是上述服务组中的资源)的基本单位,在网络附属存储集群中,对应网络虚拟网口和虚拟盘对象的集合。以虚拟网口为例说明,虚拟网口是对当前提供网络连接的若干物理网口聚合的抽象,在整个集群范围内具有唯一性。虚拟网口绑定在ACTIVE状态的物理网口上,该物理网口承载对外虚拟网口上的所有业务。当ACTIVE状态物理网口出现异常时,通过配置策略从STANDBY态保护资源集合中选举出目标对象进行接管,保证虚拟网口对外业务的不中断。Service instance: the basic unit of protection resources (which can be understood as the resources in the above-mentioned service group), in the network-attached storage cluster, it corresponds to the collection of network virtual network ports and virtual disk objects. Taking the virtual network port as an example, the virtual network port is an abstraction of the aggregation of several physical network ports that currently provide network connections, and is unique within the entire cluster. The virtual network port is bound to the physical network port in the ACTIVE state, and the physical network port carries all services on the external virtual network port. When the physical network port in the ACTIVE state is abnormal, the target object is selected from the protected resource set in the STANDBY state through the configuration policy to take over, ensuring that the external services of the virtual network port are not interrupted.

服务单元:一个具备完整功能的个体,在集群中各节点上部署,可承担服务实例的指派。存储集群系统中每个节点上包含前端网口和后端虚拟盘对象两个服务实例组成的服务单元,假定当前网络附属存储集群系统中有N个节点,一个服务单元只能承担N份ACTIVE的服务实例指派,N份STANDBY的服务实例指派。Service unit: An individual with complete functions, which is deployed on each node in the cluster and can undertake the assignment of service instances. Each node in the storage cluster system includes a service unit consisting of two service instances, the front-end network port and the back-end virtual disk object. Assuming that there are N nodes in the current network-attached storage cluster system, one service unit can only undertake N copies of ACTIVE Service instance assignment, N copies of STANDBY service instance assignment.

服务组:由一个或多个服务单元上同种资源类型对象组成的集合,多个服务组中具体对象组成服务单元。以虚拟网口为例说明,承载虚拟网口业务的所有物理网口集合组成虚拟网口的服务组。每个服务组有的主备策略,服务组之间完全独立,互不影响。每个服务组有其唯一标识,该标识在创建时指定,且在网络附属存储集群系统范围内唯一。Service group: A collection of objects of the same resource type on one or more service units, and specific objects in multiple service groups form a service unit. Taking the virtual network port as an example, all the physical network ports that bear the service of the virtual network port form the service group of the virtual network port. Each service group has an active/standby strategy, and the service groups are completely independent and do not affect each other. Each service group has its unique ID, which is specified when it is created and is unique within the network-attached storage cluster system.

归属节点:存储前后端虚拟资源在创建时指定,同一个虚拟资源只能归属于一个节点,上电时优先选择归属节点上服务单元对象作为ACTIVE的服务实例指派。Belonging node: The storage front-end and back-end virtual resources are specified when they are created. The same virtual resource can only belong to one node. When powering on, the service unit object on the home node is preferentially selected as the ACTIVE service instance.

配置策略:前后端虚拟资源在创建时指定,资源异常时根据该策略选择服务单元对象进行接管,默认按IP地址取值比较小的IP地址对应的服务单元优先接管,同时,提供接口支持人工干预,对服务单元对象配置不同权值,取权值大的优先接管发生故障的资源。Configuration strategy: The front-end and back-end virtual resources are specified at the time of creation. When the resource is abnormal, the service unit object is selected to take over according to this policy. By default, the service unit corresponding to the IP address with a relatively small IP address value is given priority to take over. At the same time, an interface is provided to support manual intervention , configure different weights for the service unit objects, and take over the failed resources first with the higher weight.

接管节点(可以理解为上述实施例的目标对象):当前后端资源ACTIVE服务单元出现异常时,根据配置策略从STANDBY节点中发起选举,产生新的ACTIVE服务单元对象,该服务单元对象所在节点称为接管节点。Takeover node (can be understood as the target object of the above embodiment): when the current back-end resource ACTIVE service unit is abnormal, an election is initiated from the STANDBY node according to the configuration policy, and a new ACTIVE service unit object is generated. The node where the service unit object is located is called to take over the node.

主决策节点:故障管理模块上电时选举产生的ACTIVE服务实例所在节点,当故障管理模块本身产生异常时,会重新发起选举,从而产生新的ACTIVE故障管理服务实例指派,新服务实例所在节点为新的主决策节点。Main decision-making node: the node where the ACTIVE service instance elected by the fault management module is powered on. When the fault management module itself is abnormal, the election will be re-initiated, thereby generating a new ACTIVE fault management service instance assignment. The node where the new service instance is located is New master decision node.

本发明优选实施例提供的技术方案可以大致总结为:通过定义保护资源模型和故障管理框架,管理网络附属存储前端网络和后端存储资源,达到整个存储集群资源的高可用。The technical solution provided by the preferred embodiment of the present invention can be roughly summarized as follows: by defining a protection resource model and a fault management framework, managing network-attached storage front-end network and back-end storage resources, high availability of the entire storage cluster resource is achieved.

当前后端部分资源出现异常时,对保护资源中部分资源异常进行心跳监控,一旦监控模块感知到异常后,告警通知故障管理模块;当故障管理模块接收到告警后,按照保护资源接管优先级决策需要接管的资源并进行接管,保证对外服务的连续性;同时记录该异常资源的切换信息;When some front-end and back-end resources are abnormal, heartbeat monitoring is performed on some of the resources in the protection resources. Once the monitoring module detects the abnormality, an alarm will be notified to the fault management module; when the fault management module receives the alarm, it will make decisions according to the protection resource takeover priority Take over the resources that need to be taken over to ensure the continuity of external services; at the same time, record the switching information of the abnormal resource;

可选地,当故障解除后,故障模块状态自动同步到保护资源组中,监控模块感知该故障恢复,向故障管理模块执行故障恢复请求,故障管理模块根据异常资源的切换信息执行相应的切回操作。Optionally, when the fault is resolved, the state of the faulty module is automatically synchronized to the protection resource group, the monitoring module perceives the fault recovery, and executes a fault recovery request to the fault management module, and the fault management module performs corresponding switchback according to the switching information of the abnormal resource operate.

在本发明实施例上述提供的技术方案中:资源保护组模型可以大致描述如下:每个节点上常驻一个监控模块,负责心跳监控管理、异常时在服务组内根据配置策略选举。该模块以守护线程形式常驻各节点,最早上电的节点为主决策节点,如果同时上电多节点,通过比较IP,选举较小IP地址值的节点为主决策节点。节点间通过远程过程调用协议(RemoteProcedureCallprotocol,简称为RPC)消息进行通信,正常情况下由主决策节点发起心跳检查,按服务组标识收集其它节点上服务单元状态信息,其它节点根据以下至少之一事件来判断决定是否重新发送信标进行新的选举:1.定时心跳检查时间是否超过了最大检查时间;2.当前ACTIVE状态的服务单元是否出现异常,在满足上述条件之一时,会向所有集群中的站点发送信标,发起ACTIVE服务单元的选举。In the technical solution provided above in the embodiment of the present invention: the resource protection group model can be roughly described as follows: a monitoring module is resident on each node, which is responsible for heartbeat monitoring and management, and is elected in the service group according to the configuration policy in case of abnormality. This module is resident in each node in the form of a daemon thread. The node that is powered on the earliest is the main decision-making node. If multiple nodes are powered on at the same time, the node with a smaller IP address value is elected as the main decision-making node by comparing IPs. Nodes communicate through Remote Procedure Call protocol (RPC) messages. Under normal circumstances, the main decision-making node initiates a heartbeat check, and collects service unit status information on other nodes according to the service group identifier. Other nodes are based on at least one of the following events: To determine whether to resend the beacon for a new election: 1. Whether the regular heartbeat check time exceeds the maximum check time; 2. Whether the service unit in the current ACTIVE state is abnormal. When one of the above conditions is met, it will report to all clusters The station sends a beacon to initiate the election of the ACTIVE service unit.

通过故障管理服务标识选举出的主决策节点故障管理模块管理整个存储的前后端资源,前后端资源中由ACTIVE服务单元来执行该服务实例的工作,所有业务承载于该服务实例上,其它各个服务单元处于该服务实例的STANDBY状态,在监控到ACTIVE服务单元异常后,该故障管理模块负责整个接管协作,具体流程协作通过以下过程实现:The failure management module of the main decision-making node elected by the fault management service identifier manages the front-end and back-end resources of the entire storage. The ACTIVE service unit in the front-end and back-end resources performs the work of the service instance, and all services are carried on the service instance. Other services The unit is in the STANDBY state of the service instance. After monitoring the abnormality of the ACTIVE service unit, the fault management module is responsible for the entire takeover cooperation. The specific process cooperation is realized through the following process:

步骤1:在各节点配置虚拟网口和虚拟盘共享存储服务组,前端虚拟网络服务组用于用户存储网络接入,后端虚拟盘存储服务组用于存放共享存储数据资源;Step 1: Configure virtual network ports and virtual disk shared storage service groups on each node. The front-end virtual network service group is used for user storage network access, and the back-end virtual disk storage service group is used to store shared storage data resources;

步骤2:将所有虚拟资源指定归属节点,注册配置资源进资源服务单元,正常情况下,虚拟资源真实运行于归属节点上的服务单元中,该服务单元为ACTIVE状态;Step 2: Assign all virtual resources to the home node, register and configure resources into the resource service unit, under normal circumstances, virtual resources actually run in the service unit on the home node, and the service unit is in the ACTIVE state;

步骤3:监控模块对所有资源保护组资源进行实时心跳监控,一旦发现保护资源组内运行资源出现异常则发出告警;Step 3: The monitoring module performs real-time heartbeat monitoring on all resource protection group resources, and sends an alarm once it finds that the running resources in the protection resource group are abnormal;

步骤4:故障管理模块接收到异常,下线当前运行异常的服务组内服务单元资源;Step 4: The fault management module receives an exception, and offline the service unit resources in the service group that are currently running abnormally;

步骤5:根据当前节点和服务组标识,根据配置策略选取出目标接管服务单元对象进行迁移并记录保存,设置新的服务单元为ACTIVE状态;Step 5: According to the current node and service group ID, select the target takeover service unit object according to the configuration policy for migration and record saving, and set the new service unit to ACTIVE state;

步骤6:当出现异常的前后端资源恢复正常之后,将自动更新资源服务组,并通知故障管理模块;Step 6: When the abnormal front-end and back-end resources return to normal, the resource service group will be automatically updated and the fault management module will be notified;

步骤7:故障管理模块根据异常时的迁移记录,切换回其上的运行资源。故障恢复,同时调整两个服务单元对象的状态。Step 7: The fault management module switches back to the running resource on it according to the migration record at the time of abnormality. Failure recovery, adjust the state of two service unit objects at the same time.

本发明优选实施例达到了以下技术效果:通过资源保护组模型,将集群节点按前端网络资源、后端存储资源进行细化,节点部分资源异常场景下,支持只接管节点异常部分,保留节点正常运行部分。从而提高了整体性能,实现网络附属存储群集资源的有效利用;满足关键业务高可用性、稳定性和扩展性的要求,可用于高可用存储集群多机热备要求的故障检测、接管决策、故障隔离与切换、恢复与扩展;通过对Paxos算法进行改进,按节点和服务组标识支持多实例选举,提高选举灵活性,故障管理模块本身加入保护资源组进行热备,简化系统实现,有效解决主决策节点上故障管理模块本身异常问题;在集群系统内部署热备主机,充分利用主机自身运算能力,提升接管响应速度,降低成本开支。The preferred embodiment of the present invention achieves the following technical effects: through the resource protection group model, the cluster nodes are refined according to the front-end network resources and the back-end storage resources. In the case of abnormal node resources, it is supported to only take over the abnormal part of the node and keep the normal node. run part. Thereby improving the overall performance and realizing the effective utilization of network-attached storage cluster resources; meeting the requirements of high availability, stability and scalability of key businesses, and can be used for fault detection, takeover decision-making, and fault isolation required by multi-machine hot backup of high-availability storage clusters and switching, recovery and expansion; by improving the Paxos algorithm, multi-instance elections are supported according to node and service group identifiers, which improves the flexibility of elections, and the fault management module itself is added to the protection resource group for hot backup, which simplifies system implementation and effectively solves the main decision The fault management module itself on the node is abnormal; deploy a hot standby host in the cluster system to make full use of the host's own computing power, improve the takeover response speed, and reduce costs.

在本实施例中还提供了一种资源的故障处理装置,该装置用于实现上述实施例及优选实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。This embodiment also provides a resource failure processing device, which is used to implement the above embodiments and preferred implementation modes, and what has already been described will not be repeated. As used below, the term "module" may be a combination of software and/or hardware that realizes a predetermined function. Although the devices described in the following embodiments are preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.

图2是根据本发明实施例的资源的故障处理装置的结构框图,如图2所示,该装置包括:Fig. 2 is a structural block diagram of a resource failure processing device according to an embodiment of the present invention. As shown in Fig. 2, the device includes:

监测模块20,用于监测网络存储集群系统中节点的指定资源是否发生故障,其中,上述指定资源为上述网络存储集群系统中预先划分的资源类型中指定资源类型所对应的资源;The monitoring module 20 is configured to monitor whether a specified resource of a node in the network storage cluster system fails, wherein the above-mentioned specified resource is a resource corresponding to a specified resource type among the pre-divided resource types in the above-mentioned network storage cluster system;

选择模块22,与监测模块20连接,用于在上述指定资源发生故障时,按照预设策略选择接管上述指定资源的目标对象。The selection module 22 is connected with the monitoring module 20, and is used for selecting a target object to take over the designated resource according to a preset policy when the designated resource fails.

通过上述各个模块的综合作用,采用对节点上的资源进行分类后,当分类后的其中一个类型的指定资源发生故障时,可以仅将发生故障的指定资源转移到其他节点上的技术方案,解决了相关技术中很多情况下节点上的资源故障都属于部分故障,但仍然将该节点隔离,将节点的业务转移到其他接管节点上而导致的接管流程复杂,容易出错,同时也增加了接管节点的负载的问题,简化了接管流程,降低了出错率,同时也较少了接管节点的负载负担Through the comprehensive functions of the above-mentioned modules, after classifying the resources on the nodes, when one of the classified resources fails, only the designated resources that have failed can be transferred to other nodes to solve the problem. In related technologies, in many cases, the resource failure on a node is a partial failure, but the node is still isolated, and the business of the node is transferred to other takeover nodes, resulting in a complicated takeover process and error-prone, which also increases the number of takeover nodes. The problem of load, simplifies the takeover process, reduces the error rate, and also reduces the load burden on the takeover node

图3是根据本发明实施例的资源的故障处理装置的另一结构框图,如图3所示,该装置除包括图2所示的所有模块外,还包括:Fig. 3 is another structural block diagram of a resource failure processing device according to an embodiment of the present invention. As shown in Fig. 3, in addition to all the modules shown in Fig. 2, the device also includes:

监测模块20为了实现上述监测网络存储集群系统中节点的指定资源是否发生故障的功能,在本发明实施例的一个可选实施例中,监测模块20可以包括如下单元:划分单元200,用于对上述网络存储集群系统中所有节点的资源进行资源类型的划分;配置单元202,与划分单元200连接,与划分单元用于将上述所有节点中资源类型相同的资源配置为一个服务组;判断单元204,与配置单元202连接,用于通过检测上述服务组中上述指定资源的状态判断上述指定资源是否发生故障,其中,判断单元204用于当上述指定资源的物理网口状态由运行态转为备用态时,确定上述指定资源发生故障。In order to realize the above-mentioned function of monitoring whether a specified resource of a node in the network storage cluster system fails, the monitoring module 20, in an optional embodiment of the embodiment of the present invention, the monitoring module 20 may include the following unit: a dividing unit 200 for Resource types of all nodes in the above-mentioned network storage cluster system are divided into resource types; the configuration unit 202 is connected to the division unit 200, and the division unit is used to configure resources of the same resource type in all the above-mentioned nodes as a service group; the judgment unit 204 , connected to the configuration unit 202, for judging whether the above-mentioned designated resource fails by detecting the state of the above-mentioned designated resource in the above-mentioned service group, wherein, the judging unit 204 is used for when the state of the physical network port of the above-mentioned designated resource changes from the running state to the standby state, it is determined that the above-mentioned specified resource is faulty.

可选地,选择模块22还可以包括如下单元:选择单元220,用于在上述指定资源所在的服务组中选择接管上述指定资源的服务单元;确定单元222,与选择单元220连接,用于将上述服务单元所在的节点作为上述目标对象。Optionally, the selection module 22 may also include the following units: a selection unit 220, configured to select a service unit to take over the specified resource in the service group where the specified resource is located; a determination unit 222, connected to the selection unit 220, for The node where the above-mentioned service unit is located serves as the above-mentioned target object.

在本发明实施例中,选择模块22中的目标对象可以理解为上述实施例的接管节点。In the embodiment of the present invention, the target object in the selection module 22 can be understood as the takeover node in the above embodiment.

结合以下优选实施例对本发明实施例的技术方案进一步详细阐述:The technical solutions of the embodiments of the present invention are further elaborated in conjunction with the following preferred embodiments:

图4为根据本发明优选实施例的资源保护组模型示意图,如图4所示,有两个服务组:虚拟网口服务组和虚拟盘服务组,有两个服务实例:虚拟网口服务实例和虚拟盘服务实例。虚拟网口服务实例由虚拟网口服务组来保护执行,虚拟盘服务实例由虚拟盘服务组来保护执行。其中,实线箭头指向ACTIVE服务单元对象,实际上承载业务,虚线箭头指向STANDBY服务单元对象,异常时指派出新ACTIVE单元接管对象。Fig. 4 is a schematic diagram of a resource protection group model according to a preferred embodiment of the present invention. As shown in Fig. 4, there are two service groups: a virtual network port service group and a virtual disk service group, and there are two service instances: a virtual network port service instance and virtual disk service instance. The virtual network port service instance is protected and executed by the virtual network port service group, and the virtual disk service instance is protected and executed by the virtual disk service group. Among them, the solid line arrow points to the ACTIVE service unit object, which actually carries services, and the dotted line arrow points to the STANDBY service unit object, and a new ACTIVE unit is assigned to take over the object in case of an exception.

由图4所提供的示意图可以知晓:虚拟网口服务组内,安排服务单元3执行虚拟网口服务实例的ACTIVE工作,服务单元1和服务单元2执行虚拟网口服务服务实例的STANDBY工作,图4中虚拟盘网口服务实例和虚拟盘服务实例与服务单元中的连线实线代表的是ACTIVE;虚线连接为STANDBY指派。From the schematic diagram provided in Figure 4, it can be known that in the virtual network port service group, service unit 3 is arranged to perform the ACTIVE work of the virtual network port service instance, and service unit 1 and service unit 2 perform the STANDBY work of the virtual network port service instance, as shown in Fig. The connection between the virtual disk network port service instance and the virtual disk service instance and the service unit in 4. The solid line represents ACTIVE; the dotted line connection is STANDBY designation.

虚拟盘服务组内,安排服务单元2执行虚拟盘服务实例的ACTIVE工作,服务单元1和服务单元3执行虚拟盘服务实例的STANDBY工作。In the virtual disk service group, arrange service unit 2 to execute the ACTIVE work of the virtual disk service instance, and service unit 1 and service unit 3 to execute the STANDBY work of the virtual disk service instance.

图5为根据本发明优选实施例的资源的故障处理流程图,如图5所示:Fig. 5 is a flowchart of fault handling of resources according to a preferred embodiment of the present invention, as shown in Fig. 5:

在节点的部分资源异常场景中,资源故障触发的整个接管流程:In the scenario where some resources of a node are abnormal, the entire takeover process triggered by a resource failure:

步骤S502:资源归属节点业务保护资源状态发生变化(由设备故障或者人机命令触发),从ACTIVE转变为STANDBY状态,通知本节点上监控代理模块;Step S502: The state of the service protection resource of the resource belonging node changes (triggered by equipment failure or man-machine command), changes from ACTIVE to STANDBY state, and notifies the monitoring agent module on this node;

步骤S504:主决策节点监控模块通过定时心跳与各节点监控代理通信,感知到对应类型的保护资源状态异常,向本节点故障管理模块发送切换请求;Step S504: The main decision-making node monitoring module communicates with each node monitoring agent through a regular heartbeat, perceives that the state of the corresponding type of protection resource is abnormal, and sends a switching request to the fault management module of this node;

步骤S506:故障管理模块通知异常归属节点代理模块将受影响的资源下线,执行资源下线操作,进行资源清理后向主决策节点故障管理模块回复资源下线响应;Step S506: The fault management module notifies the abnormal home node agent module to take the affected resource offline, executes the resource offline operation, and returns a resource offline response to the main decision node fault management module after resource cleaning;

步骤S508:主决策节点故障管理模块收到资源下线响应,根据配置策略,选举出该异常资源的接管节点,并向接管节点代理模块发送资源上线请求;Step S508: The main decision node failure management module receives the resource offline response, elects a takeover node for the abnormal resource according to the configuration policy, and sends a resource online request to the takeover node agent module;

步骤S510:目标节点代理模块收到资源上线请求,向业务模块执行资源上线操作后,通知主决策节点故障管理模块,回复资源上线响应;Step S510: After receiving the resource online request, the agent module of the target node notifies the fault management module of the main decision-making node after performing the resource online operation to the business module, and replies with a resource online response;

步骤S512:主决策节点故障管理模块收到资源上线响应,认为切换完成,向本节点监控模块回复切换响应,流程结束。Step S512: The failure management module of the main decision-making node receives the resource online response, considers that the switching is completed, and replies the switching response to the monitoring module of this node, and the process ends.

图6为根本发明优选实施例的资源切回流程图,如图6所示:Fig. 6 is a flow chart of resource switching back in a preferred embodiment of the fundamental invention, as shown in Fig. 6:

在节点的部分资源异常恢复场景中,资源故障恢复触发的整个切回流程:In the scenario where some resources of a node are abnormally restored, the entire switchback process triggered by resource failure recovery:

步骤S602:资源归属节点业务保护资源状态发生变化(由设备故障恢复或者人机命令触发)从STANDBY转变为ACTIVE状态,通知本节点上监控代理模块;Step S602: The state of the service protection resource of the resource belonging node changes (triggered by equipment failure recovery or man-machine command) from STANDBY to ACTIVE state, and notifies the monitoring agent module on this node;

步骤S604:主决策节点监控模块通过定时心跳与各节点监控代理通信,感知到对应类型的活动保护资源状态恢复,向本节点故障管理模块发送切换请求;Step S604: The main decision-making node monitoring module communicates with each node monitoring agent through a regular heartbeat, perceives that the state of the corresponding type of active protection resource is restored, and sends a switching request to the fault management module of this node;

步骤S606:故障管理模块通知接管节点代理模块将资源下线,进行资源清理后向主决策节点故障管理模块回复资源下线响应;Step S606: the fault management module notifies the agent module of the takeover node to take the resources offline, and after cleaning up the resources, returns a resource offline response to the main decision node fault management module;

步骤S608:主决策节点故障管理模块收到资源下线响应,向原归属节点代理模块发送资源上线请求;Step S608: The main decision node failure management module receives the resource offline response, and sends a resource online request to the original home node agent module;

步骤S610:资源归属节点代理模块收到资源上线请求,向业务模块执行资源上线操作后,向主决策节点故障管理模块回复资源上线响应;Step S610: After receiving the resource online request, the agent module of the resource home node executes the resource online operation to the business module, and then returns a resource online response to the fault management module of the master decision node;

步骤S612:主决策节点故障管理模块收到资源上线响应,认为切换完成,向本节点监控模块回复切回响应,流程结束。Step S612: The failure management module of the main decision-making node receives the resource online response, considers that the switchover is completed, and returns a switchback response to the monitoring module of the local node, and the process ends.

通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。Through the description of the above embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is Better implementation.

在另外一个实施例中,还提供了一种软件,该软件用于执行上述实施例及优选实施方式中描述的技术方案。In another embodiment, software is also provided, and the software is used to implement the technical solutions described in the above embodiments and preferred implementation manners.

在另外一个实施例中,还提供了一种存储介质,该存储介质中存储有上述软件,该存储介质包括但不限于:光盘、软盘、硬盘、可擦写存储器等。In another embodiment, there is also provided a storage medium, in which the software is stored, the storage medium includes but not limited to: optical discs, floppy disks, hard disks, rewritable memories, and the like.

需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的对象在适当情况下可以互换,以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first" and "second" in the description and claims of the present invention and the above drawings are used to distinguish similar objects, but not necessarily used to describe a specific sequence or sequence. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having", as well as any variations thereof, are intended to cover a non-exclusive inclusion, for example, a process, method, system, product or device comprising a sequence of steps or elements is not necessarily limited to the expressly listed instead, may include other steps or elements not explicitly listed or inherent to the process, method, product or apparatus.

综上所述,本发明实施例达到了以下技术效果:简化了接管流程,降低了出错率,同时也较少了接管节点的负载负担,也就是说,采用本发明实施例的技术方案:接管节点只接管有问题的部分资源,由于故障所在节点没有隔离,要避免资源出现多端加载,保证业务的一致性,持续对外提供服务。In summary, the embodiment of the present invention achieves the following technical effects: the takeover process is simplified, the error rate is reduced, and the load burden on the takeover node is also reduced, that is to say, the technical solution of the embodiment of the present invention is adopted: takeover Nodes only take over some resources that have problems. Since the node where the fault is located is not isolated, it is necessary to avoid multi-terminal loading of resources, ensure business consistency, and continue to provide external services.

显然,本领域的技术人员应该明白,上述的本发明的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。Obviously, those skilled in the art should understand that each module or each step of the above-mentioned present invention can be realized by a general-purpose computing device, and they can be concentrated on a single computing device, or distributed in a network formed by multiple computing devices Alternatively, they may be implemented in program code executable by a computing device so that they may be stored in a storage device to be executed by a computing device, and in some cases in an order different from that shown here The steps shown or described are carried out, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps among them are fabricated into a single integrated circuit module for implementation. As such, the present invention is not limited to any specific combination of hardware and software.

以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims (10)

1. a fault handling method for resource, is characterized in that, comprising:
Whether the allocated resource of monitoring network storage cluster system interior joint breaks down, and wherein, described allocated resource is for specifying the resource corresponding to resource type in the resource type that divides in advance in described network storage group system;
When described allocated resource breaks down, select the destination object of the described allocated resource of adapter according to preset strategy.
2. method according to claim 1, is characterized in that, whether the allocated resource of monitoring network storage cluster system interior joint breaks down and comprise:
The division of resource type is carried out to the resource of all nodes in described network storage group system;
Be a service group by resource distribution identical for resource type in described all nodes;
Whether broken down by allocated resource described in the condition adjudgement that detects allocated resource described in described service group.
3. method according to claim 2, is characterized in that, determines that described allocated resource breaks down in a case where:
When the physical internet ports state of described allocated resource transfers state for subsequent use to by run mode, determine that described allocated resource breaks down.
4. method according to claim 2, is characterized in that, selects the destination object of the described allocated resource of adapter, comprising according to preset strategy:
The service unit of the described allocated resource of adapter is selected in the service group at described allocated resource place;
Using the node at described service unit place as described destination object.
5. method according to claim 4, is characterized in that, selects by one of following mode the service unit taking over described allocated resource in the service group at described resource place:
From described service group, described service unit is selected according to the priority preset;
Described service unit is selected according to the IP address values of service unit described in described service group.
6. the method according to any one of claim 1 to 5, is characterized in that, after described destination object is taken over the described allocated resource broken down, also comprises:
Preserve the handover information of described specified fault, wherein, described handover information comprise following one of at least: the resource type that the origin node information at described allocated resource place, described allocated resource are corresponding;
When the origin node fault recovery at described allocated resource place, according to described handover information, described allocated resource is switched back described origin node.
7. a fault treating apparatus for resource, is characterized in that, comprising:
Monitoring modular, whether the allocated resource for monitoring network storage cluster system interior joint breaks down, and wherein, described allocated resource is for specifying the resource corresponding to resource type in the resource type that divides in advance in described network storage group system;
Select module, for when described allocated resource breaks down, select the destination object of the described allocated resource of adapter according to preset strategy.
8. device according to claim 7, is characterized in that, described monitoring modular comprises:
Division unit, for carrying out the division of resource type to the resource of all nodes in described network storage group system;
Dispensing unit, for being a service group by resource distribution identical for resource type in described all nodes;
Whether judging unit, break down for allocated resource described in the condition adjudgement by detecting allocated resource described in described service group.
9. device according to claim 8, is characterized in that, described judging unit is used for when the physical internet ports state of described allocated resource transfers state for subsequent use to by run mode, determines that described allocated resource breaks down.
10. device according to claim 8, is characterized in that, described selection module, comprising:
Selected cell, for selecting the service unit of the described allocated resource of adapter in the service group at described allocated resource place;
Determining unit, for using the node at described service unit place as described destination object.
CN201410545516.4A 2014-10-15 2014-10-15 Fault processing method of resources and device Pending CN105515812A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201410545516.4A CN105515812A (en) 2014-10-15 2014-10-15 Fault processing method of resources and device
PCT/CN2015/072923 WO2016058307A1 (en) 2014-10-15 2015-02-12 Fault handling method and apparatus for resource

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410545516.4A CN105515812A (en) 2014-10-15 2014-10-15 Fault processing method of resources and device

Publications (1)

Publication Number Publication Date
CN105515812A true CN105515812A (en) 2016-04-20

Family

ID=55723475

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410545516.4A Pending CN105515812A (en) 2014-10-15 2014-10-15 Fault processing method of resources and device

Country Status (2)

Country Link
CN (1) CN105515812A (en)
WO (1) WO2016058307A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107247564A (en) * 2017-07-17 2017-10-13 郑州云海信息技术有限公司 A kind of method and system of data processing
CN107276849A (en) * 2017-06-15 2017-10-20 北京奇艺世纪科技有限公司 The method for analyzing performance and device of a kind of cluster
CN108289034A (en) * 2017-06-21 2018-07-17 新华三大数据技术有限公司 A kind of fault discovery method and apparatus
CN111176783A (en) * 2019-11-20 2020-05-19 航天信息股份有限公司 High-availability method and device for container treatment platform and electronic equipment
CN111865682A (en) * 2020-07-16 2020-10-30 北京百度网讯科技有限公司 Method and apparatus for handling faults
CN111984463A (en) * 2020-07-03 2020-11-24 浙江华云信息科技有限公司 Micro application management method and device based on edge computing system
CN112306813A (en) * 2020-11-13 2021-02-02 苏州浪潮智能科技有限公司 System alarm method and device
CN112463535A (en) * 2020-11-27 2021-03-09 中国工商银行股份有限公司 Multi-cluster exception handling method and device
CN114039836A (en) * 2021-11-05 2022-02-11 光大科技有限公司 Fault handling method and device of exporter collector

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111628958B (en) * 2019-07-12 2022-08-05 国铁吉讯科技有限公司 Network access method, device and system based on linear networking
CN111200518B (en) * 2019-12-25 2022-10-18 曙光信息产业(北京)有限公司 Decentralized HPC computing cluster management method and system based on paxos algorithm
CN111552556B (en) * 2020-03-24 2023-06-09 北京中科云脑智能技术有限公司 GPU cluster service management system and method
CN112104727B (en) * 2020-09-10 2021-11-30 华云数据控股集团有限公司 Method and system for deploying simplified high-availability Zookeeper cluster
CN114116122B (en) * 2021-10-28 2025-03-25 北京银盾泰安网络科技有限公司 A high-availability load platform for application containers
CN114157585B (en) * 2021-12-09 2024-09-20 京东科技信息技术有限公司 Method and device for monitoring service resources
CN114745557B (en) * 2022-03-22 2024-05-24 浙江大华技术股份有限公司 Disaster recovery operation execution method and device, storage medium and electronic device
CN115134219A (en) * 2022-06-29 2022-09-30 北京飞讯数码科技有限公司 Device resource management method and device, computing device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000074304A2 (en) * 1999-05-28 2000-12-07 Teradyne, Inc. Network fault isolation
CN1926809A (en) * 2004-03-04 2007-03-07 思科技术公司 Methods and devices for high network availability
CN1969494A (en) * 2004-02-13 2007-05-23 阿尔卡特无线技术公司 Method and system for providing availability and reliability for a telecommunication network entity
CN201039274Y (en) * 2007-02-09 2008-03-19 宋景明 Modular pluggable board multi-function VoIP gateway
CN101369241A (en) * 2007-09-21 2009-02-18 中国科学院计算技术研究所 A cluster fault-tolerant system, device and method
CN102239665A (en) * 2010-12-13 2011-11-09 华为技术有限公司 Method and device for management service

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103167004A (en) * 2011-12-15 2013-06-19 中国移动通信集团上海有限公司 Cloud platform host system failure repair method and cloud platform front-end control server
CN103617006A (en) * 2013-11-28 2014-03-05 曙光信息产业股份有限公司 Storage resource management method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000074304A2 (en) * 1999-05-28 2000-12-07 Teradyne, Inc. Network fault isolation
CN1969494A (en) * 2004-02-13 2007-05-23 阿尔卡特无线技术公司 Method and system for providing availability and reliability for a telecommunication network entity
CN1926809A (en) * 2004-03-04 2007-03-07 思科技术公司 Methods and devices for high network availability
CN201039274Y (en) * 2007-02-09 2008-03-19 宋景明 Modular pluggable board multi-function VoIP gateway
CN101369241A (en) * 2007-09-21 2009-02-18 中国科学院计算技术研究所 A cluster fault-tolerant system, device and method
CN102239665A (en) * 2010-12-13 2011-11-09 华为技术有限公司 Method and device for management service

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107276849A (en) * 2017-06-15 2017-10-20 北京奇艺世纪科技有限公司 The method for analyzing performance and device of a kind of cluster
CN108289034A (en) * 2017-06-21 2018-07-17 新华三大数据技术有限公司 A kind of fault discovery method and apparatus
CN107247564A (en) * 2017-07-17 2017-10-13 郑州云海信息技术有限公司 A kind of method and system of data processing
CN111176783A (en) * 2019-11-20 2020-05-19 航天信息股份有限公司 High-availability method and device for container treatment platform and electronic equipment
CN111984463A (en) * 2020-07-03 2020-11-24 浙江华云信息科技有限公司 Micro application management method and device based on edge computing system
CN111865682A (en) * 2020-07-16 2020-10-30 北京百度网讯科技有限公司 Method and apparatus for handling faults
CN111865682B (en) * 2020-07-16 2023-08-08 北京百度网讯科技有限公司 Method and device for handling faults
CN112306813A (en) * 2020-11-13 2021-02-02 苏州浪潮智能科技有限公司 System alarm method and device
CN112306813B (en) * 2020-11-13 2023-03-14 苏州浪潮智能科技有限公司 System alarm method and device
CN112463535A (en) * 2020-11-27 2021-03-09 中国工商银行股份有限公司 Multi-cluster exception handling method and device
CN112463535B (en) * 2020-11-27 2024-05-10 中国工商银行股份有限公司 Multi-cluster exception handling method and device
CN114039836A (en) * 2021-11-05 2022-02-11 光大科技有限公司 Fault handling method and device of exporter collector

Also Published As

Publication number Publication date
WO2016058307A1 (en) 2016-04-21

Similar Documents

Publication Publication Date Title
CN105515812A (en) Fault processing method of resources and device
CN108632067B (en) Disaster recovery deployment method, device and system
CN1554055B (en) High availability cluster virtual server system
CN111158962B (en) Remote disaster recovery method, device and system, electronic equipment and storage medium
CN105099793B (en) Hot standby method, device and system
WO2017000260A1 (en) Method and apparatus for switching vnf
CN104158707A (en) Method and device of detecting and processing brain split in cluster
CN111176888B (en) Cloud storage disaster recovery method, device and system
CN104320274A (en) Disaster tolerance method and device
CN104038376A (en) Method and device for managing real servers and LVS clustering system
CN102497288A (en) Dual-server backup method and dual system implementation device
CN103490914A (en) Switching system and method for multi-machine hot backup of network application equipment
CN105634848B (en) A kind of virtual router monitoring method and device
CN113872997A (en) Container group POD reconstruction method and related equipment based on container cluster service
CN108347339A (en) A kind of service restoration method and device
JP7206981B2 (en) Cluster system, its control method, server, and program
EP3618350A1 (en) Protection switching method, device and system
WO2024179028A1 (en) Cloud technology-based detection method and cloud management platform
US11418382B2 (en) Method of cooperative active-standby failover between logical routers based on health of attached services
CN104243304A (en) Data processing method, device and system of locally-connected topological structure
CN116668269A (en) Arbitration method, device and system for dual-activity data center
CN105490847A (en) Real-time detecting and processing method of node failure in private cloud storage system
US10516625B2 (en) Network entities on ring networks
CN104052799A (en) Method for achieving high availability storage through resource rings
CN115499296B (en) Cloud desktop hot standby management method, device and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160420

WD01 Invention patent application deemed withdrawn after publication