CN104270286B

CN104270286B - A kind of SDN network node failure localization method

Info

Publication number: CN104270286B
Application number: CN201410483842.7A
Authority: CN
Inventors: 赵永利; 杨辉; 崔雅迪; 张�杰; 高冠军
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2014-09-19
Filing date: 2014-09-19
Publication date: 2018-10-16
Anticipated expiration: 2034-09-19
Also published as: CN104270286A

Abstract

In order to be restored automatically when solving the problems, such as to occur in SDN network lost contact failure, the present invention proposes a kind of method by nodal plane to SDN network lost contact fault location, is restored according to the result of positioning.Include the following steps：When keep-alive message disappears, node judges that lost contact failure occurs；If neighbor node state is normal, by controller starter node alternate algorithm, alternative command is issued to substitute node.If neighbor node also lost contact, continue to send inquiry message to other nodes, to each lost contact node within the scope of the lost contact, alternative command is issued from controller to substitute node until determining lost contact range.If the service condition lost contact of lost contact node, migrates existing business.If the whole network node whole lost contact, it is judged as that controller damages, starts controller restoration methods.This programme realizes lost contact Fault Locating Method by node layer and takes corresponding recovery policy, has the characteristics that quick.

Description

A SDN network node fault location method

技术领域technical field

本发明涉及SDN网络技术领域，尤其涉及控制器与节点之间的失联故障发生时，由节点平面进行的故障定位方法及选择其对应的恢复方法。The present invention relates to the technical field of SDN network, in particular to a fault location method performed by a node plane and a corresponding recovery method selected when a disconnection fault occurs between a controller and a node.

背景技术Background technique

近年来SDN网络已成为研究热点，其本身所具有的控制与转发相分离的特性为网络发展及研究带来了新的发展方向和研究热点。In recent years, SDN network has become a research hotspot, and its own characteristics of separation of control and forwarding have brought new development directions and research hotspots for network development and research.

SDN网络主要由控制层和节点层构成，其中控制层负责整个网络中的资源及业务管理，而节点层面则构成了网络拓扑，承载了网络中的业务传送。二者之间通过Openflow协议进行通信。The SDN network is mainly composed of a control layer and a node layer. The control layer is responsible for resource and business management in the entire network, while the node layer constitutes the network topology and carries the business transmission in the network. The two communicate through the Openflow protocol.

本发明所涉及的失联故障是指控制器和节点之间无法进行正常的保活消息收发以及Openflow协议通信。在SDN网络中，当失联故障发生时，如何对失联故障位置进行定位，进而采取适当的恢复策略，会影响网络故障恢复的效果。The disconnection fault involved in the present invention refers to the inability to perform normal keep-alive message sending and receiving and Openflow protocol communication between the controller and the node. In an SDN network, when a disconnection fault occurs, how to locate the location of the disconnection fault and then adopt an appropriate recovery strategy will affect the effect of network fault recovery.

在上述SDN网络架构中，失联故障既可能发生在节点平面，可能发生在控制层面，也可能是二者之间的通信网络发生故障。而如何利用SDN网络的特点及架构，通过网络节点平面自身来尽快进行故障定位，并由此采取不同的恢复策略，则成为SDN网络能否具有足够的健壮性的重要影响因素。In the above SDN network architecture, a disconnection failure may occur at the node plane, at the control plane, or at the communication network between the two. How to use the characteristics and architecture of the SDN network to locate faults as soon as possible through the network node plane itself, and thus adopt different recovery strategies, has become an important factor affecting the robustness of the SDN network.

发明内容Contents of the invention

为了解决SDN网络中发生失联故障时进行自动恢复的问题，本发明提出一种通过节点平面对SDN网络进行失联故障定位的方法，SDN网络能够根据定位的结果进行恢复。In order to solve the problem of automatic recovery when a disconnection fault occurs in the SDN network, the present invention proposes a method for locating the disconnection fault of the SDN network through a node plane, and the SDN network can recover according to the positioning result.

本发明所述的SDN网络节点故障定位方法，包括以下步骤：The SDN network node fault location method of the present invention comprises the following steps:

SDN网络节点通过本节点与控制器之间的保活消息判定本节点是否失联，当保活消息消失时，判断为失联故障发生；The SDN network node judges whether the node is disconnected through the keep-alive message between the node and the controller. When the keep-alive message disappears, it is judged that a disconnection fault occurs;

失联节点向邻居节点发送问询消息；The lost node sends an inquiry message to the neighbor node;

若所述邻居节点状态正常，则判定为所述失联节点与控制器之间存在网络故障、或所述失联节点自身失效；所述失联节点通过正常的邻居节点向控制器发出替代请求；由控制器启动节点替代算法，向替代节点下发替代指令。If the state of the neighbor node is normal, it is determined that there is a network fault between the lost node and the controller, or the lost node itself fails; the lost node sends a replacement request to the controller through a normal neighbor node ; The controller starts the node replacement algorithm, and sends replacement instructions to the replacement node.

若所述邻居节点失联，则所述邻居节点继续向其他节点发送问询消息，直到确定失联范围；所述失联节点通过正常的邻居节点向控制器发出替代请求；对每个所述失联节点，由控制器启动节点替代算法，向替代节点下发替代指令。If the neighbor node loses connection, the neighbor node continues to send inquiry messages to other nodes until the range of loss of connection is determined; the lost connection node sends a replacement request to the controller through a normal neighbor node; for each of the For lost nodes, the controller starts the node replacement algorithm and sends replacement instructions to the replacement nodes.

进一步地，失联节点查询自身现有业务状态；如果所述失联节点的业务状态失常，由控制器进一步启动业务迁移算法，对所述失联节点的现有业务进行迁移。Further, the disconnected node inquires its own existing business status; if the business status of the disconnected node is abnormal, the controller further starts a business migration algorithm to migrate the existing business of the disconnected node.

进一步地，当确定失联范围时，若全网节点全部与控制器失联，判断为控制器损坏，启动传统的控制器恢复方法。Further, when determining the range of loss of connection, if all nodes in the entire network lose connection with the controller, it is determined that the controller is damaged, and a traditional controller recovery method is started.

本方案充分利用SDN本身的架构特性和网络通信特性，从节点平面出发，在不引入其他设施的情况下，通过SDN网络自身实现了针对失联故障的定位方法，借此能够在SDN网络发生失联故障时，尽快地对故障进行定位并采取对应的恢复策略，具有消耗低、时延小的特点。This solution makes full use of the architecture characteristics and network communication characteristics of SDN itself. Starting from the node plane, without introducing other facilities, the SDN network itself realizes the location method for disconnection faults, so that failures can occur in the SDN network. When a connection fault occurs, locate the fault as soon as possible and adopt a corresponding recovery strategy, which has the characteristics of low consumption and low delay.

附图说明Description of drawings

图1是基于节点平面的故障定位方法架构图。Figure 1 is an architecture diagram of a fault location method based on a node plane.

图2是基于节点平面的故障定位方法实施例。Fig. 2 is an embodiment of a fault location method based on a node plane.

具体实施方式Detailed ways

图1是基于节点平面的故障定位方法架构图，图中是一个单域模型的SDN网络例子，由一个控制器和6个节点组成。6个节点相互连接组成所述“节点平面”，各个节点之间的细黑实线表示节点之间互联，并由此构成了域内拓扑。控制器与节点之间的虚线表示二者之间传递保活消息。带叉号的虚线则表示控制器与该节点之间的保活消息失效，此时节点6成为失联节点。由节点6发出的，指向其邻居节点的箭头表示该节点在察觉自身处于失联状态后，向邻居发出询问消息。由控制器指向节点3的箭头表示控制器通过节点替代算法计算后确定节点3为节点6的替代节点，向所述替代节点下发替代指令。通过节点2、节点6、及节点4的虚线代表节点6失联之前的一条业务路径，该业务路径在节点6仅失联而业务状态正常时保持。如果节点6既发生失联又发生业务故障，则需要恢复业务路径。例如通过节点2、节点3、节点4的粗实线则表示通过控制器的节点替代算法及业务迁移算法联合计算之后的恢复业务路径。Figure 1 is an architecture diagram of a fault location method based on a node plane. The figure is an example of an SDN network of a single domain model, which consists of a controller and 6 nodes. The six nodes are connected to each other to form the "node plane", and the thin black solid lines between the nodes represent the interconnection between the nodes, thus forming the topology in the domain. The dotted line between the controller and the node indicates that the keep-alive message is passed between the two. A dotted line with a cross indicates that the keep-alive message between the controller and the node fails, and node 6 becomes a disconnected node at this time. The arrow sent by node 6 pointing to its neighbor node indicates that the node sends an inquiry message to the neighbor after realizing that it is in a disconnected state. The arrow pointing from the controller to node 3 indicates that the controller determines that node 3 is the replacement node of node 6 after calculating through the node replacement algorithm, and issues a replacement instruction to the replacement node. The dotted line passing through node 2, node 6, and node 4 represents a service path before node 6 loses connection, and the service path is maintained when node 6 is only lost but the service state is normal. If node 6 is both disconnected and has a service failure, the service path needs to be restored. For example, the thick solid line passing through node 2, node 3, and node 4 represents the restoration service path after joint calculation by the node replacement algorithm and the service migration algorithm of the controller.

图2是本发明故障定位方法的实施例，整个流程包括以下步骤：Fig. 2 is an embodiment of the fault location method of the present invention, and the whole process includes the following steps:

101：故障判定，通过节点与控制平面之间的连接情况判定节点自身是处于正常状态还是失联状态。正常情况下，节点与控制平面之间存在保活消息，同时节点可对控制平面的指令作出对应的动作，当节点无法正常接收保活消息，或无法收取控制平面指令时，认定自身所处的网络中发生本发明所定义的失联故障。101: Fault judgment, judging whether the node itself is in a normal state or disconnected state according to the connection between the node and the control plane. Under normal circumstances, there is a keep-alive message between the node and the control plane. At the same time, the node can take corresponding actions on the command of the control plane. A disconnection fault as defined in the present invention occurs in the network.

201：失联节点向邻居节点发送问询消息。201: The disconnected node sends an inquiry message to the neighbor node.

202：判断邻居节点状态是否正常。202: Determine whether the state of the neighbor node is normal.

203：若邻居节点状态正常(未失联)，则判断为失联节点与控制器之间存在网络故障，或节点自身失效(但控制器运转正常)，此时失联节点通过所述邻居节点发起重连，向控制器发出替代请求。进入第301步；203: If the status of the neighbor node is normal (not disconnected), it is judged that there is a network failure between the disconnected node and the controller, or the node itself fails (but the controller is operating normally), and the disconnected node passes through the neighbor node Initiate a reconnection, making an alternative request to the controller. Go to step 301;

204：若邻居节点也失联，则继续向外问询，直到确定失联范围，并进入第401步。204: If the neighbor nodes are also out of contact, continue to inquire outside until the range of outage is determined, and go to step 401.

301：失联节点查询自身现有业务状态，并将结果通过邻居节点回复给控制器，301: The disconnected node inquires its own existing business status, and replies the result to the controller through the neighbor node,

302：判断业务状态是否正常，分为以下两种情况：302: Judging whether the business status is normal, divided into the following two situations:

情况一：自身现有业务状态正常，即失联节点无法再承载新业务，但对现有途经失联节点的业务没有影响。此时进入第303步。情况二：自身现有业务状态失常，即失联节点既无法承载现有的途经自身的业务，同时也无法继续承载新业务。此时进入第304步Situation 1: The existing business status of itself is normal, that is, the lost node can no longer carry new business, but it has no impact on the existing business passing through the lost node. Go to step 303 at this point. Situation 2: The existing business status of itself is abnormal, that is, the disconnected node can neither bear the existing business passing through itself nor continue to bear new business. Now go to step 304

303：控制器在自身内部将此节点标记为失联节点，此时若没有新业务到来，控制器可暂时不需其他动作；当途经该节点的新业务到来时，控制器触发节点替代算法，计算后向替代节点下发替代指令，并由替代节点承载新进业务。303: The controller internally marks this node as a disconnected node. If no new business arrives at this time, the controller does not need other actions for the time being; when a new business passing through this node arrives, the controller triggers the node replacement algorithm, After calculation, a replacement command is issued to the replacement node, and the replacement node carries the new business.

304：由控制器启动节点替代算法及业务迁移算法，并下发替代指令及进行失联节点的现有业务迁移。304: The controller starts the node replacement algorithm and the service migration algorithm, and issues a replacement instruction and performs the existing service migration of the disconnected node.

401：根据失联节点在网络中所占数量，判断是否全部节点失联。失联范围分为部分失联和全部失联两种情况：401: According to the number of lost nodes in the network, determine whether all nodes are lost. The scope of disconnection is divided into partial disconnection and complete disconnection:

情况一：部分失联。若判断部分失联，则转第301步骤。当网络中存在可与控制平面正常通信的节点，即为部分失联情况。此时认定控制平面运行正常，是由节点平面的部分节点失效导致的故障，此时每个失联节点执行第301-304步，以期最大限度地恢复网络业务。Case 1: Partially lost connection. If it is judged that some of them are out of contact, go to step 301. When there are nodes in the network that can communicate with the control plane normally, it is a partial loss of connection. At this time, it is determined that the normal operation of the control plane is caused by the failure of some nodes of the node plane. At this time, each disconnected node performs steps 301-304 in order to restore network services to the maximum extent.

情况二：全部失联。转第402步。Case 2: All connections are lost. Go to step 402.

402：当控制器损坏导致全域失联，启动传统的控制器恢复方法。所述传统的控制器恢复方法，是由控制平面或网管来触发和执行，启动作为备用控制器的计算机。402: When the controller is damaged and the global domain is lost, start the traditional controller recovery method. The traditional controller recovery method is triggered and executed by the control plane or the network manager, and starts the computer as the standby controller.

经过以上303、304或402步骤后，网络得到恢复。After the above steps 303, 304 or 402, the network is restored.

第402步中的例外情况是控制器并未损坏，全部节点失联表明全部节点均损坏或网络的基础硬件损坏使全网失去控制，无法通过算法恢复。The exception in step 402 is that the controller is not damaged, and the loss of all nodes indicates that all nodes are damaged or the basic hardware of the network is damaged so that the entire network is out of control and cannot be restored through the algorithm.

Claims

1. a kind of SDN network node failure localization method, which is characterized in that include the following steps：

SDN network node by the keep-alive message between this node and controller judge this node whether lost contact, work as keep-alive message When disappearance, it is judged as that lost contact failure occurs；

Lost contact node sends inquiry message to neighbor node；

If the neighbor node state is normal, it is determined as between the lost contact node and controller that there are network failure or institutes State lost contact node itself fail；

If also lost contact, the neighbor node continue to send inquiry message to other nodes, be lost until determining the neighbor node Join range；

Lost contact range is divided into two kinds of situations of part lost contact and whole lost contacts, if judgment part lost contact, the lost contact node passes through Normal neighbor node sends out replacement request to controller；

Lost contact querying node itself existing business state, and result is restored by neighbor node to controller；

To each lost contact node,

Situation one：Itself existing business state is normal, when the new business of the approach node arrives, is replaced by controller triggering node For algorithm, alternative command is issued to substitute node；

Situation two：Itself existing business loss of form is issued replacement to substitute node and is referred to by controller starter node alternate algorithm It enables.

2. SDN network node failure localization method as described in claim 1, which is characterized in that further comprising the steps of：

If the service condition of the lost contact node is not normal, further start business migration algorithm by controller, to the lost contact The existing business of node is migrated.

3. SDN network node failure localization method as claimed in claim 1 or 2, which is characterized in that further comprising the steps of：

When determining lost contact range, it is found that the whole network node all with controller lost contact, is judged as that controller damages, starts traditional Controller restoration methods；

Traditional controller restoration methods, are triggered and are executed by control plane or webmaster, are started and are used as spare control The computer of device.