CN109362113B

CN109362113B - Underwater acoustic sensor network cooperation exploration reinforcement learning routing method

Info

Publication number: CN109362113B
Application number: CN201811310120.6A
Authority: CN
Inventors: 冯晓宁; 宋雪; 王卓
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2018-11-06
Filing date: 2018-11-06
Publication date: 2022-03-18
Anticipated expiration: 2038-11-06
Also published as: CN109362113A

Abstract

The invention relates to the technical fields of underwater acoustic sensor networks and underwater acoustic routing protocols, in particular to a method of reinforcement learning routing for cooperative exploration of underwater acoustic sensor networks. The present invention includes the following steps: (1) initializing the Q value and V value of each node; (2) judging

Whether it is established; (3) the relay node receives the data packet/control packet, updates the neighbor list, and judges whether to continue forwarding; (4) the sink receives the data packet and ends this transmission. Routing protocols based on reinforcement learning can approximate the global optimum when selecting paths, and can incorporate multiple factors that affect performance. In the present invention, when the algorithm does not converge, the source node sends several control packets while sending the data packets to speed up the convergence of the algorithm, otherwise, only the data packets are sent. After the algorithm converges, the approximate global optimal path is achieved by selecting the next hop node with the highest V value, which balances the network energy consumption, prolongs the network life, and solves the problem of slow convergence of reinforcement learning.

Description

An underwater acoustic sensor network cooperative exploration reinforcement learning routing method

技术领域technical field

本发明涉及水声传感器网络、水声路由协议技术领域，特别涉及一种水声传感器网络合作探索强化学习路由方法。The invention relates to the technical fields of underwater acoustic sensor networks and underwater acoustic routing protocols, in particular to a method of reinforcement learning routing for cooperative exploration of underwater acoustic sensor networks.

背景技术Background technique

水声传感器网络，Underwater Acoustic Sensor Networks，即UASNs，由水下部署的传感器节点和用于接收数据的汇聚节点sink组成。这些节点提供了许多应用如环境监测、战术监视、资源勘探、辅助导航和灾难防御等。由于无线电波高传输损耗的限制，水下通信常采用声波。同时，UASNs面临着电池容量有限、误码率高、端到端时延高、可用带宽有限等独特的挑战。Underwater Acoustic Sensor Networks, or UASNs, consist of sensor nodes deployed underwater and sink nodes for receiving data. These nodes provide many applications such as environmental monitoring, tactical surveillance, resource exploration, aided navigation, and disaster prevention. Due to the limitation of high transmission loss of radio waves, underwater communication often uses acoustic waves. At the same time, UASNs face unique challenges such as limited battery capacity, high bit error rate, high end-to-end latency, and limited available bandwidth.

由于UASNs的高延迟、高能耗以及低带宽等固有特性，其网络拓扑结构通常为分布式网络。其路由协议面临的一个主要问题是寻找高效且节能的路径。与环境试错交互以寻找最大期望奖励的强化学习算法已被应用于UASNs，基于强化学习的路由协议，每一个节点在选择路径时不必知道全网拓扑信息就可近似达到全局最优。强化学习算法可以使节点学习和适应其所处的动态环境，并且能够合并多项影响路由性能的因素，使路由决策考虑的更为全面。在本发明中，用源节点V值的收敛速度表征强化学习的收敛速度。Due to the inherent characteristics of UASNs such as high latency, high energy consumption, and low bandwidth, their network topology is usually a distributed network. A major problem facing its routing protocol is finding efficient and energy-efficient paths. Reinforcement learning algorithms that interact with the environment to find the maximum expected reward have been applied to UASNs. Routing protocols based on reinforcement learning, each node can approximate the global optimality without knowing the topology information of the entire network when choosing a path. Reinforcement learning algorithms can make nodes learn and adapt to the dynamic environment they are in, and can combine multiple factors that affect routing performance, making routing decisions more comprehensive. In the present invention, the convergence speed of reinforcement learning is represented by the convergence speed of the V value of the source node.

在UASNs中，随着网络规模的扩大，强化学习的收敛速度减慢，网络能量消耗大，并在网络拓扑改变时，不能很好的跟踪其变化，影响网络性能。In UASNs, with the expansion of the network scale, the convergence speed of reinforcement learning slows down, the network energy consumption is large, and when the network topology changes, the changes cannot be well tracked, which affects the network performance.

发明内容SUMMARY OF THE INVENTION

本发明的目的是针对上述现有技术的不足，提出一种水声传感器网络合作探索强化学习路由方法。在算法未收敛时，源节点发送数据包的同时发送数个控制包对路径进行合作探索，以加速其V值的收敛，解决了强化学习收敛速度慢的问题，同时减小了网络能耗，延长了网络寿命。The purpose of the present invention is to propose a method of reinforcement learning routing for cooperative exploration of underwater acoustic sensor networks in view of the above-mentioned deficiencies of the prior art. When the algorithm does not converge, the source node sends several control packets while sending data packets to explore the path cooperatively to accelerate the convergence of its V value, solve the problem of slow convergence of reinforcement learning, and reduce network energy consumption. Extends network life.

本发明可以通过如下的技术方案实现：The present invention can be achieved through the following technical solutions:

一种水声传感器网络合作探索强化学习路由方法，该方法包括以下步骤：An underwater acoustic sensor network cooperative exploration reinforcement learning routing method, which includes the following steps:

(1)初始化各节点Q值及V值；(1) Initialize the Q value and V value of each node;

(2)确定源节点s下一时刻的V值

(2) Determine the V value of the source node s at the next moment

(3)根据各节点的Q值及V值，判断

是否成立：(3) According to the Q value and V value of each node, judge

Is it established:

(3.1)如果判断成立，源节点只发送数据包；(3.1) If the judgment is true, the source node only sends data packets;

(3.2)如果判断不成立，源节点在发送数据包的同时发送控制包；(3.2) If the judgment is not established, the source node sends the control packet while sending the data packet;

(4)根据源节点发送的数据包或控制包，中继节点接收数据并读取包头；(4) According to the data packet or control packet sent by the source node, the relay node receives the data and reads the packet header;

(5)根据中继节点接收的数据更新路由表，并判断其是否继续发往本节点，若判断数据是发往本节点，则计算Q值，更新V值至包头，并继续传输数据包；(5) Update the routing table according to the data received by the relay node, and judge whether it continues to be sent to this node, if it is judged that the data is sent to this node, then calculate the Q value, update the V value to the packet header, and continue to transmit the data packet;

(6)判断汇聚节点sink是否收到数据包：(6) Determine whether the sink node receives the data packet:

(6.1)若sink收到数据包，则结束本次传输；(6.1) If the sink receives the data packet, it will end the transmission;

(6.2)若sink没有收到数据包，则重复步骤(2)到步骤(6)，直至sink收到数据包。(6.2) If the sink does not receive the data packet, repeat steps (2) to (6) until the sink receives the data packet.

所述步骤(1)包括以下步骤：Described step (1) comprises the following steps:

(1.1)确定奖励函数；(1.1) Determine the reward function;

(1.2)根据奖励函数，确定各节点的Q值迭代函数；(1.2) According to the reward function, determine the Q-value iteration function of each node;

步骤(1.1)所述奖励函数R_nm为第一节点n向第二节点m传输数据包/控制包完成后所获得的即时奖励，奖励函数按下式结算：The reward function R _nm in step (1.1) is the instant reward obtained after the first node n transmits the data packet/control packet to the second node m, and the reward function is calculated as follows:

R_nm＝-g-α₁c+α₂dR _nm =-g-α ₁ c+α ₂ d

其中，g为节点在传输数据时的固定损耗，c为节点剩余能量消耗函数，d为节点能量分布情况，α₁为节点剩余能量消耗函数c的比重参数，α₂为节点能量分布情况d的比重参数；Among them, g is the fixed loss of the node when transmitting data, c is the node's remaining energy consumption function, d is the node energy distribution, α1 is the proportion parameter _of the node's remaining energy consumption function c, and _α2 is the node energy distribution d. Specific gravity parameter;

步骤(1.2)所述Q值的迭代函数按下式计算：The iterative function of the Q value described in step (1.2) is calculated as follows:

其中，

表示第一节点n在t+1时刻的Q值，α表示Q值的更新速率，γ是折扣因子，

为第二节点m在t时刻的Q值。in,

represents the Q value of the first node n at time t+1, α represents the update rate of the Q value, γ is the discount factor,

is the Q value of the second node m at time t.

步骤(3)所述判断条件

中，

表示源节点在t时刻的V值，

表示t+1时刻源节点的V值，ε表示一个大于0的极小值；The judgment condition described in step (3)

middle,

represents the V value of the source node at time t,

Represents the V value of the source node at time t+1, and ε represents a minimum value greater than 0;

若步骤(3.1)所述判断成立，源节点终止控制包的传输，结合路由表通过Q值迭代公式计算最优路径向上传输数据包，直至sink；If the judgment in step (3.1) is established, the source node terminates the transmission of the control packet, and calculates the optimal path to transmit the data packet upward through the Q-value iteration formula in combination with the routing table, until the sink;

其中，源节点下一时刻的V值计算函数为：Among them, the calculation function of the V value of the source node at the next moment is:

其中，α与步骤(1.2)所述Q值的迭代函数中的α数值相同，在这里表示学习速率，意为V值的更新速率；ω为控制包探测路径的归一化参数，

表示各数据包或控制包探索路径所得到的经验。Among them, α is the same as the value of α in the iterative function of the Q value described in step (1.2), where it represents the learning rate, which means the update rate of the V value; ω is the normalization parameter that controls the packet detection path,

Indicates the experience gained by each data packet or control packet exploring the path.

若如步骤(5)所述该包发往本节点，根则据Q值迭代函数计算Q值，选定Q值为其最大值Q_max时的节点为下一跳节点，并将V值更新为Q_max，并改写节点信息至包头继续传输。If the packet is sent to this node as described in step (5), the Q value is calculated according to the Q value iterative function, the node whose Q value is its maximum value Q _max is selected as the next hop node, and the V value is updated is Q _max , and rewrite the node information to the packet header to continue transmission.

本发明与现有技术相比，本发明的有益效果在于：Compared with the prior art, the present invention has the following beneficial effects:

(1)本发明提供了一种水声传感器网络合作探索强化学习路由算法，在算法未收敛时，源节点同时发送数据包和控制包，加快了源节点V值的收敛速度。(1) The present invention provides an underwater acoustic sensor network cooperative exploration reinforcement learning routing algorithm. When the algorithm does not converge, the source node sends data packets and control packets at the same time, which speeds up the convergence speed of the V value of the source node.

(2)本发明在算法收敛后，通过选择V值最高的下一跳节点实现近似全局最优路径，从而均衡了网络能耗，延长了网络寿命。(2) After the algorithm converges, the present invention realizes an approximate global optimal path by selecting the next hop node with the highest V value, thereby balancing the network energy consumption and prolonging the network life.

附图说明Description of drawings

图1是水声传感器网络结构图。Figure 1 is a structural diagram of an underwater acoustic sensor network.

图2是合作探索强化学习路由方法的示意图。Figure 2 is a schematic diagram of cooperative exploration of reinforcement learning routing methods.

图3是源节点实现合作探索强化学习算法的流程图。Figure 3 is a flow chart of the source node implementing the cooperative exploration reinforcement learning algorithm.

图4是路由转发流程图。FIG. 4 is a flow chart of routing and forwarding.

具体实施方式Detailed ways

下面结合附图对本发明做进一步阐述。The present invention will be further described below with reference to the accompanying drawings.

显然，所描述的实施例仅是本发明一部分实施例，而不是全部实施例。因此，以下对在附图中提供的本发明的实施例的详细描述并非旨在限制要求保护的本发明的范围，而是仅仅表示本发明的选定实施例。基于发明的实施例，本领域技术人员没有做出创造性劳动的前提下所获得的所有其他实施例，都属于本发明保护的范围。Obviously, the described embodiments are only some, but not all, embodiments of the present invention. Thus, the following detailed description of the embodiments of the invention provided in the accompanying drawings is not intended to limit the scope of the invention as claimed, but is merely representative of selected embodiments of the invention. Based on the embodiments of the invention, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present invention.

本发明提供一种水声传感器网络合作探索强化学习路由方法。基于强化学习的路由协议在选择路径时能够近似达到全局最优，并且可以合并多项影响性能的因素。本发明中，在算法未收敛时，源节点发送数据包的同时发送数个只含有包头信息的控制包对路径进行合作探索，以加速源节点V值的收敛，否则，只发送数据包。本发明解决了强化学习收敛速度慢的问题，同时减小了网络能耗，延长了网络寿命。本发明具体包含以下步骤：The invention provides an underwater acoustic sensor network cooperative exploration reinforcement learning routing method. Routing protocols based on reinforcement learning can approximate the global optimum when selecting paths, and can incorporate multiple factors that affect performance. In the present invention, when the algorithm does not converge, the source node sends several control packets containing only packet header information to explore the path cooperatively while sending data packets, so as to accelerate the convergence of the V value of the source node, otherwise, only send data packets. The invention solves the problem of slow convergence of reinforcement learning, reduces network energy consumption, and prolongs network life. The present invention specifically comprises the following steps:

(1)初始化各节点Q值及V值。(1) Initialize the Q value and V value of each node.

(2)确定源节点s下一时刻的V值

(2) Determine the V value of the source node s at the next moment

(3)判断

是否成立，其中，

表示源节点在t时刻的V值，ε表示一个大于0的极小值。如果成立，源节点只发送数据包；否则，源节点在发送数据包的同时发送控制包。(3) Judgment

is established, where,

Represents the V value of the source node at time t, and ε represents a minimum value greater than 0. If true, the source node only sends the data packet; otherwise, the source node sends the control packet at the same time as the data packet.

(4)中继节点收到数据包/控制包，更新邻居列表，并判断是否继续转发。(4) The relay node receives the data packet/control packet, updates the neighbor list, and judges whether to continue forwarding.

(5)sink收到数据包，结束本次传输。(5) The sink receives the data packet and ends the transmission.

步骤(2)中，源节点的V值迭代函数为：In step (2), the V value iteration function of the source node is:

其中，α表示学习速率，意为V值的更新速率，它控制了先前的V值与新的V值之间的差异有多少被考虑在内。γ是折扣因子，意为经验对当前的V值的影响。ω为控制包探测路径的归一化参数，

表示各数据包/控制包探索路径所得到的经验。步骤(3)中，在

时，源节点只发送数据包；在

时，源节点在发送数据包的同时发送控制包。where α represents the learning rate, meaning the update rate of the V value, which controls how much of the difference between the previous V value and the new V value is taken into account. γ is the discount factor, which means the influence of experience on the current value of V. ω is the normalization parameter that controls the packet detection path,

Indicates the experience obtained by each packet/control packet exploring the path. In step (3), in

When , the source node only sends data packets;

, the source node sends control packets at the same time as sending data packets.

附图1为本发明实施例提供的水声传感器网络结构图，附图2为本发明实施例提供的合作探索强化学习路由方法的示意图。结合上述结构图和示意图，本实施例公开了一种水声传感器网络合作探索强化学习路由协议的实现方法，如附图3和附图4所示，具体如下：FIG. 1 is a structural diagram of an underwater acoustic sensor network provided by an embodiment of the present invention, and FIG. 2 is a schematic diagram of a cooperative exploration reinforcement learning routing method provided by an embodiment of the present invention. With reference to the above-mentioned structural diagram and schematic diagram, the present embodiment discloses a method for implementing a reinforcement learning routing protocol for cooperative exploration of underwater acoustic sensor networks, as shown in FIG. 3 and FIG. 4 , and the details are as follows:

(1)初始化各节点Q值V值。(1) Initialize the Q value and V value of each node.

(2)确定奖励值函数。(2) Determine the reward value function.

在本实施例中，奖励值函数R_nm为节点n向节点m传输数据包/控制包完成后所获得的即时奖励。In this embodiment, the reward value function R _nm is the instant reward obtained after the node n transmits the data packet/control packet to the node m.

R_nm＝-g-α₁c+α₂dR _nm =-g-α ₁ c+α ₂ d

g为节点在传输数据时的固定损耗，c为节点剩余能量消耗函数，d为节点能量分布情况，α₁和α₂分别为c与d的比重参数。g is the fixed loss when the node transmits data, c is the residual energy consumption function of the node, d is the energy distribution of the node, α ₁ and α ₂ are the proportion parameters of c and d, respectively.

(3)确定各节点的Q值迭代函数。(3) Determine the Q-value iteration function of each node.

表示节点n在t+1时刻的Q值，α表示Q值的更新速率，γ是折扣因子，

为节点m在t时刻的Q值。

represents the Q value of node n at time t+1, α represents the update rate of the Q value, γ is the discount factor,

is the Q value of node m at time t.

(4)确定源节点V值计算函数。(4) Determine the V value calculation function of the source node.

表示下一时刻源节点的V值，ω为控制包探测路径的归一化参数，

表示各数据包/控制包探索路径所得到的经验。

represents the V value of the source node at the next moment, ω is the normalization parameter of the control packet detection path,

Indicates the experience obtained by each packet/control packet exploring the path.

(5)源节点进行合作探索。(5) The source node conducts cooperative exploration.

本发明的水声传感器网络结构图如附图1所示，简单起见，本实施例的网络结构为单源-单sink，源节点负责收集数据，并将收集到的数据通过水声网络沿着中继节点逐步向上传输，直到sink。sink通过水声接收器接收来自海面下中继节点的数据，并用无线电波向基站发送数据，基站收到sink的数据后进行后续分析和处理。The structure diagram of the underwater acoustic sensor network of the present invention is shown in FIG. 1. For simplicity, the network structure of this embodiment is a single source-single sink, and the source node is responsible for collecting data, and passing the collected data through the underwater acoustic network along the The relay node gradually transmits upwards until the sink. The sink receives the data from the relay nodes under the sea surface through the underwater acoustic receiver, and sends the data to the base station by radio waves, and the base station performs subsequent analysis and processing after receiving the data from the sink.

结合附图2具体说明，当

时，源节点同时发送数据包和控制包，在本实施例中，为方便说明，将控制包定为两个。In conjunction with accompanying drawing 2, it is described in detail that when

When the source node sends the data packet and the control packet at the same time, in this embodiment, for the convenience of description, two control packets are set.

源节点根据Q值迭代函数计算Q值并更新V值，根据计算结果选择一个节点发送数据包，两个节点发送控制包，在本实施例中，源节点选择它的邻居节点3作为数据包传输的下一跳节点，同时选择节点1和节点5作为控制包传输的下一跳节点。The source node calculates the Q value and updates the V value according to the Q value iteration function, and selects one node to send the data packet according to the calculation result, and two nodes send the control packet. In this embodiment, the source node selects its neighbor node 3 as the data packet transmission. The next hop node is selected, and node 1 and node 5 are selected as the next hop nodes for control packet transmission.

节点1，3，5监听到数据包/控制包后读取包头，将上一跳节点的信息更新至自己的邻居列表中，如果该包发往本节点，根据Q值迭代函数计算Q值，选定Q值为Q_max的节点为下一跳节点，并将V值更新为Q_max，并改写节点信息至包头继续传输。Nodes 1, 3, and 5 read the packet header after listening to the data packet/control packet, and update the information of the previous hop node to their neighbor list. If the packet is sent to this node, the Q value is calculated according to the Q value iteration function. The node with the Q value of Q _max is selected as the next hop node, the V value is updated to Q _max , and the node information is rewritten to the packet header to continue transmission.

节点1，3，5的邻居节点重复上述动作直到数据包/控制包到达sink。The neighbor nodes of node 1, 3, and 5 repeat the above actions until the data packet/control packet arrives at the sink.

(6)当

时，源节点停止发送控制包。(6) When

, the source node stops sending control packets.

源节点判断

成立，此时，源节点终止控制包的传输，结合路由表通过Q值迭代公式计算最优路径向上传输数据包，直至sink。source node judgment

is established, at this time, the source node terminates the transmission of the control packet, and uses the Q-value iteration formula to calculate the optimal path to transmit the data packet upwards in combination with the routing table until the sink.

Claims

1. A method for collaboratively exploring and strengthening learning routing by an underwater acoustic sensor network is characterized by comprising the following steps:

step 1: initializing Q values and V values of all nodes;

step 2: determining the V value of the source node s at the next moment

Wherein, V_t ^sA V value representing the source node s at time t; α is the update rate; gamma is a discount factor; omega is a normalization parameter of the detection path of the control packet; reward function R_sjFor the instant reward R obtained after the transmission of data/control packets from the source node s to the node j has been completed_sj＝-g-α₁c+α₂d, g is the fixed loss of the node in data transmission, c is the function of the node residual energy consumption, d is the node energy distribution condition, alpha₁And alpha₂Specific gravity parameters of c and d, respectively;

and step 3: if it is

The source node sends the data packet and the control packet at the same time; if it is

The source node only sends the data packet; ε represents a minimum value greater than 0;

and 4, step 4: after receiving the data packet/control packet, the relay node reads the packet header, updates the information of the previous hop node into a neighbor list of the relay node, and judges whether the relay node continues to send the data packet/control packet to the relay node; if the data packet/control packet is sent to the node, calculating Q value, and selecting Q value as Q_maxIs the next hop node, and updates the value of V to Q_maxRewriting the node information to the packet header for continuous transmission;

wherein,

represents the Q value of the node n at the time t; reward function R_nmFor the instant reward obtained after the first node n has finished transmitting data packets/control packets to the second node m, R_nm＝-g-α₁c+α₂d；

And 5: if the sink node receives the data packet, the sink node finishes the transmission; otherwise, repeating the steps (2) to (4) until the sink receives the data packet.