CN120595851A

CN120595851A - UAV flight attitude adjustment method and system based on reinforcement learning

Info

Publication number: CN120595851A
Application number: CN202510835183.7A
Authority: CN
Inventors: 刘莉莉; 王辉; 吴朝晖; 艾纪平; 陈诚; 邵帅; 胡润明; 何斯维; 邹建晔
Original assignee: Yingtan Power Supply Co of State Grid Jiangxi Electric Power Co Ltd
Current assignee: Yingtan Power Supply Co of State Grid Jiangxi Electric Power Co Ltd
Priority date: 2025-06-20
Filing date: 2025-06-20
Publication date: 2025-09-05

Abstract

The invention relates to the technical field of unmanned aerial vehicle flight attitude control, in particular to an unmanned aerial vehicle flight attitude adjustment method and system based on reinforcement learning. The method comprises the steps of constructing a dynamic environment raster image based on a digital map and a real-time semantic segmentation result, generating an initial track through a fast search random tree algorithm introducing space-time constraint, collecting unmanned aerial vehicle flight state information, environment perception information and image definition indexes, inputting a space-time attention coder to generate a state tensor fusing semantics, carrying out importance distribution on the state tensor through an information entropy weighting mechanism to obtain a weighted state vector, inputting the weighted state vector to a Meta-SAC model with Meta learning capability, outputting a target flight reference pose and an LQR controller dynamic gain coefficient, driving the LQR controller to generate a flight control instruction based on the information by a control module, constructing a reward function based on image quality and energy consumption efficiency, and feeding back a reinforcement learning strategy in real time.

Description

Unmanned aerial vehicle flight attitude adjustment method and system based on reinforcement learning

Technical Field

The invention relates to the technical field of unmanned aerial vehicle flight attitude control, in particular to an unmanned aerial vehicle flight attitude adjustment method and system based on reinforcement learning.

Background

Unmanned aerial vehicle has obtained wide application in fields such as electric power inspection, geological survey, security protection control, especially in electric power line inspection task, unmanned aerial vehicle can replace the manual work to accomplish high-risk, complicated high altitude inspection task. However, the conventional unmanned aerial vehicle relies on manual remote control or a preset route to execute tasks, has the problems of poor autonomy, weak environmental adaptability, insufficient flight control precision and the like, and is difficult to meet the requirements of high-quality image acquisition and high-efficiency operation in a complex environment.

Most of the current flight control systems adopt a mode based on PID (proportion-integral-derivative) control or fixed route planning, and the methods depend on static rules, lack self-adaptive capacity and are difficult to cope with complex factors such as dynamic obstacles, wind field disturbance and the like. In the multi-target task, the existing control method cannot cooperatively process a plurality of indexes, so that the control effect is poor, and problems such as image blurring, target loss, route deviation and the like are easy to occur. In addition, the system for partially integrating path planning and control optimization adopts a staged processing strategy, lacks an end-to-end feedback mechanism, cannot continuously optimize the control strategy in the execution process, and limits the global performance capability of the system. Particularly, in high-precision aerial photography or complex-structure inspection, the cooperative control requirement on the flight attitude and the camera visual angle is high, and the traditional controller is difficult to achieve good balance between precision and real-time performance.

Therefore, a flight control method with adaptive learning capability and dynamic feedback optimization is needed to improve the intelligent level and operation performance of the unmanned aerial vehicle in the real inspection task.

Disclosure of Invention

The invention provides an unmanned aerial vehicle flight attitude adjustment method and system based on reinforcement learning. The unmanned aerial vehicle flight control system aims at solving the problems that a traditional unmanned aerial vehicle flight control system is poor in dynamic environment adaptability, inaccurate in posture adjustment, unstable in image acquisition quality and the like.

In order to achieve the above purpose, the present invention provides the following technical solutions:

The unmanned aerial vehicle flight attitude adjustment method based on reinforcement learning comprises the following steps:

Constructing a dynamic environment grid according to the digital map and the real-time semantic segmentation result, and generating an initial track by introducing a space-time constraint fast search random tree algorithm;

Acquiring flight state information of an unmanned aerial vehicle on an initial track, environment perception information of an onboard sensor and image definition information, inputting the information into a space-time attention encoder, and generating a state tensor fusing environment semantics;

Introducing an information entropy weighting mechanism to the state tensor to perform state importance distribution to obtain a weighted state vector;

inputting the weighted state vector into a Meta-SAC model with Meta-learning capability, and generating output target flight reference pose parameters and gain coefficients for dynamically adjusting the LQR controller through an Actor network;

based on the target flight reference pose parameters and the LQR gain coefficients, driving an LQR controller to generate a low-layer flight control instruction so as to control the unmanned aerial vehicle to execute corresponding pose actions;

In the process of executing the flight control instruction, a fuzzy logic reward function is constructed based on the energy consumption efficiency and the definition of the patrol image, and the reward function value is used as feedback of the Meta-SAC model to carry out strategy updating.

Further, the step of constructing the dynamic environment grid is as follows:

Identifying and classifying obstacles in the environment by using a semantic segmentation model based on a digital map of a target inspection area and an image acquired by an airborne sensor;

and fusing the static geographic information and the dynamic obstacle identification information to construct a three-dimensional dynamic environment raster image containing time dimension, wherein the raster records the trafficability, the obstacle type and the dynamic behavior attribute of each area in the environment.

Further, the step of generating the initial track is as follows:

In the dynamic environment grid, taking the current position of the unmanned aerial vehicle as a root node, taking a target point as an end point, and adopting a fast search random tree algorithm to perform path expansion;

And introducing Lyapunov stability constraint in path expansion, performing track fairing processing through a B spline curve, and outputting an initial flight track sequence containing time and attitude information.

Further, the step of generating a state tensor fusing the environment semantics comprises the following steps:

Carrying out normalization preprocessing on flight state information, environment perception information and image definition information of the unmanned aerial vehicle;

inputting the preprocessed information into a space-time attention encoder, and fusing multi-modal data by utilizing a multi-head attention mechanism;

and extracting the correlation characteristics of the spatial context and the time sequence by combining the semantic segmentation result, and constructing a state tensor comprising position information, obstacle threat degree, image definition predicted value and energy consumption estimation.

Further, the state importance allocation step comprises the following steps:

measuring the change degree of each sub-state component in the state tensor based on the information entropy theory;

Converting the information entropy value into a weight factor for representing the importance degree of each state dimension in policy decision;

And carrying out weighted fusion on the state tensors according to the weight factors to generate weighted state vectors with state importance perception capability.

Further, the step of generating the low-layer flight control instruction by the LQR controller is as follows:

Receiving target flight reference pose parameters output by a Meta-SAC model as expected input;

constructing a system state error vector by combining the current flight state of the unmanned aerial vehicle, and calculating a control instruction according to the dynamically adjusted feedback gain coefficient;

And (3) minimizing a state error and controlling an energy consumption cost function by solving a linear quadratic optimization problem, and outputting a low-level control quantity to drive the aircraft to execute corresponding attitude adjustment.

Further, the step of constructing the fuzzy logic rewarding function is as follows:

Acquiring image definition indexes, energy consumption rate and relative distance between obstacles of the unmanned aerial vehicle in the flight process, and carrying out normalization processing on the continuous variables;

Converting the normalized input variable into fuzzy variables, and mapping the fuzzy variables into preset fuzzy membership functions to obtain corresponding fuzzy language values;

fuzzy reasoning is carried out based on a preset fuzzy rule base, and the reasoning rule comprises combination judgment of image definition, energy consumption and obstacle distance;

the reasoning result is converted into specific numerical rewards through a defuzzification method and is used as an instant feedback signal of the Meta-SAC model to update the strategy network.

The invention also provides an unmanned aerial vehicle flight attitude adjusting system based on reinforcement learning, which comprises:

The route planning module is used for constructing a dynamic environment grid according to the digital map and the real-time semantic segmentation result, and generating an initial route by introducing a space-time constraint fast search random tree algorithm;

The information fusion module is used for acquiring flight state information of the unmanned aerial vehicle on an initial track, environment perception information of an onboard sensor and image definition information, inputting the information into the space-time attention encoder and generating a state tensor fusing environment semantics;

the tensor weighting module is used for introducing an information entropy weighting mechanism to the state tensor to distribute the state importance so as to obtain a weighted state vector;

The pose generation module is used for inputting the weighted state vector into a Meta-SAC model with Meta-learning capability, and generating output target flight reference pose parameters and gain coefficients for dynamically adjusting the LQR controller through an Actor network;

The pose control module is used for driving the LQR controller to generate a low-layer flight control instruction so as to control the unmanned aerial vehicle to execute corresponding pose actions based on the target flight reference pose parameters and the LQR gain coefficients;

And the feedback updating module is used for constructing a reward function based on the energy consumption efficiency and the definition of the patrol image in the process of executing the flight control instruction, and taking the reward function value as feedback of the Meta-SAC model to carry out strategy updating.

The beneficial effects of the invention are as follows:

1. and a space-time constraint fast search random tree algorithm is introduced, namely, the response capability of the flight path to dynamic environment change is remarkably improved by incorporating the time window and the obstacle dynamic behavior modeling into the RRT path planning process. By combining Lyapunov stability constraint and B-spline fairing processing, the generated track not only meets the dynamic accessibility requirement, but also has continuity and smoothness, and the flight safety and path controllability are effectively improved.

2. And (3) introducing an information entropy weight mechanism to weight tensors, namely carrying out dynamic importance distribution on state tensors generated by multi-mode perception by adopting the information entropy weight mechanism, and realizing the attention enhancement on key environment variables (such as image fuzzy areas or high threat barriers). The mechanism improves the representation capability of state input and the distinguishing degree of training samples, so that the learning efficiency and the gesture decision accuracy of the Meta-SAC strategy model are improved.

3. And constructing a fuzzy logic rewarding function, namely constructing the fuzzy logic rewarding function with a multi-rule fusion characteristic by taking performance indexes such as image definition, energy consumption efficiency, flight stability and the like as input, and realizing multi-target guidance on a flight strategy. Compared with the traditional single index feedback mechanism, the method can adaptively adjust the rewarding intensity, and strengthen the performance robustness and generalization capability of the strategy model in complex patrol tasks.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

FIG. 1 is a flow chart of an unmanned aerial vehicle flight attitude adjustment method based on reinforcement learning provided by the invention;

Fig. 2 is a schematic structural diagram of an unmanned aerial vehicle flight attitude adjustment system based on reinforcement learning.

Detailed Description

The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.

Example 1

The unmanned aerial vehicle flight attitude adjustment method based on reinforcement learning, as shown in fig. 1, comprises the following steps:

S100, constructing a dynamic environment grid according to a digital map and a real-time semantic segmentation result, and generating an initial track by introducing a space-time constraint fast search random tree algorithm;

further, the step of constructing the dynamic environment grid is as follows:

Specifically, based on a preset inspection task area, two-dimensional or three-dimensional digital map data containing information such as topography, height, buildings and the like is loaded. The map data may originate from satellite imagery, a geographic information system, or a high-precision map.

Then, environmental images acquired in real time by using sensors such as an unmanned aerial vehicle-mounted camera or a laser radar are input into a trained semantic segmentation neural network (such as DeepLabV & lt3+ & gt, HRNet & lt/EN & gt) to perform pixel-level recognition and semantic classification on obstacles in the environmental images, such as dynamic or static obstacles such as trees, telegraph poles, personnel and vehicles, and output corresponding semantic tag diagrams.

On the basis, the existing static geographic information (such as roads, buildings, bridges and the like) in the digital map and the dynamic barrier information in the semantic segmentation result are spatially fused to generate a three-dimensional dynamic environment grid containing space coordinates and time labels. The grid is constructed using a fixed resolution rasterization method (e.g., octree or voxel grid), each grid element records information fields of trafficability (e.g., trafficability/non-trafficability/risk areas), and,

Obstacle type (e.g., static building, moving object) and dynamic behavior (e.g., presence or absence of current frame, prediction of direction of motion at next moment).

The construction of the dynamic environment grid can realize unified modeling of static geographic information and dynamic barrier information in the inspection area, so that the space-time resolution of environment expression is improved, and a real-time and quantifiable environment constraint basis is provided for track planning and posture adjustment. By integrating semantic segmentation results and time dimension information, the system can accurately judge the trafficability of each area and the dynamic behavior of the obstacle, and effectively improve the obstacle avoidance capability and safety of path planning.

Further, the step of generating the initial track is as follows:

Specifically, the three-dimensional dynamic environment raster image constructed in the step S100 is used as a path search space, the current three-dimensional space position of the unmanned aerial vehicle is used as a root node for quickly searching the random tree, and the target inspection point is used as a target end point. In the RRT path expansion process, an algorithm gradually connects and constructs a tree structure based on feasible nodes randomly sampled in a sampling space, and simultaneously, the time dimension information of an environment grid is combined to ensure the time sequence rationality of the path, so that the collision with a dynamic obstacle is avoided.

In order to improve the dynamic stability of the path and the feasibility of subsequent control, a Lyapunov function is introduced in the path expansion process to carry out stability constraint, namely, in each expansion step, path nodes which can cause gesture divergence or control incapacity are screened out by evaluating the dynamic stability of candidate path segments, and the path is ensured to have global asymptotic stability in a gesture tracking and control input range. The lyapunov function can be expressed as:

;

Wherein, the Representing the current state vector of the unmanned aerial vehicle, comprising state variables such as position, speed, attitude angle and the like,Representing the reference state vector of the object,Represents a symmetrical positive definite matrix, meets the Lyapunov stability condition,Representing a positive function of the state deviation. On the basis, only the path segments meeting the following conditions are reserved in the path expansion process:

;

Wherein, the Is a linear approximate state matrix (which can be obtained by linearization or empirical modeling) of the unmanned aerial vehicle dynamic model in the current state.

And then, carrying out B spline curve interpolation processing on the discrete path node sequence generated by searching to realize continuous fairing of the flight path, and embedding node time stamp and attitude parameter (such as course angle, pitch angle and roll angle) information in the interpolation process to construct a complete initial flight path sequence containing time and attitude constraints.

By adopting a fast random tree searching algorithm introducing Lyapunov stability constraint in a dynamic environment grid, the method can realize efficient three-dimensional track searching, ensure the controllability and stability of a path in a dynamic obstacle environment, avoid generating a path which cannot be tracked or has gesture divergence risk, further carry out fairing processing on the track through a B spline curve, not only promote the continuity and the flyability of the track, but also facilitate the accurate tracking and gesture adjustment of a subsequent controller, simultaneously reserve the time and gesture information of nodes, and provide a structured and high-quality initial flight reference for a downstream reinforcement learning and control module.

S200, acquiring flight state information of an unmanned aerial vehicle on an initial track, environment perception information of an onboard sensor and image definition information, and inputting the information to a space-time attention encoder to generate a state tensor fusing environment semantics;

specifically, the image sharpness information is obtained by calculating average of target edge sharpness, image contrast and image detail retention.

The image definition information comprehensively considers the target edge definition, the image contrast and the image detail retention, and calculates by adopting an average weight strategy, and the formula is as follows:

;

Wherein, the The image sharpness information is represented by a picture,The sharpness of the edge of the object is indicated,Representing the contrast of the image and,Representing the image detail retention, wherein the object edge sharpness is used to measure the sharpness of the object boundary in the image, the sharper the image the sharper. In the embodiment, a Sobel operator is adopted for edge detection, and an average value of the gray gradient amplitude of the image is calculated as a definition index:

;

Wherein, the AndRespectively representing imagesA kind of electronic deviceThe number of rows and columns in the up-stream,AndRepresentation ofThe gradient image in the direction, the contrast of the image reflects the difference of the bright and dark areas, and is an important index of the visual quality of the image. Calculation was performed using standard deviation:

;

Wherein, the The gray value of the pixel of the image is represented,The image detail retention is used for measuring the retention condition of high-frequency information such as texture, contour and the like. In the embodiment, laplace transformation is adopted to conduct second-order differentiation of the image, the detail area of the image is extracted, and then the average value of the absolute value of the detail area is counted to serve as a detail index:

;

Wherein, the Representing an image atThe laplace response value of the position.

Specifically, to generate a state tensor fusing environment semantics, firstly, flight state information (including position, speed, acceleration, attitude angle and the like), environment perception information (including obstacle distribution, dynamic target information and the like identified by a laser radar, ultrasonic waves or a visual sensor) and image definition information (including indexes such as edge definition, image contrast, texture detail retention and the like) of an unmanned aerial vehicle on-board flight control system and a sensor system are obtained.

And carrying out normalization pretreatment on the multisource information to unify the numerical range of the multisource information to the [0,1] interval so as to enhance the stability and the calculation efficiency of model input. Specifically, linear normalization can be adopted for positions, speeds and the like, and standard fraction or maximum and minimum normalization mode can be adopted for the image definition index.

And then, the processed flight state vector, the environment perception vector and the image definition vector are input into a space-time attention encoder together, the encoder adopts a multi-head attention mechanism to extract interaction features among different modal information respectively, automatically learns the dependency relationship and weight among the information, and improves the robustness and discriminant of state representation.

The encoder further combines the pixel-level barrier semantic tags output from the semantic segmentation module in the fusion process to perform joint modeling on the spatial context information and the time sequence features. And finally constructing a structured state tensor, wherein the tensor comprises the following fields of current position and track deviation, threat degree estimated values of obstacles in all directions, definition predicted values of current visual angle images and energy consumption rate estimated values in unit time, and the estimated values are used as the state input of a subsequent reinforcement learning strategy network.

The state tensor integrating the environment semantics is generated by inputting the flight state, the environment perception and the image definition information to the space-time attention encoder, so that deep association among the multi-mode information can be effectively mined, and the perception capability of state representation on dynamic environment change is improved. The method combines semantic tags and time sequence characteristics, and enhances the expression integrity and discriminant of the input state.

S300, introducing an information entropy weighting mechanism to the state tensor to perform state importance distribution to obtain a weighted state vector;

Further, the state importance allocation step comprises the following steps:

Specifically, first, a state tensor of the spatiotemporal attention encoder output is obtained, noted asWherein each sub-state componentCorresponding to a particular state characteristic such as positional deviation, obstacle threat level, image sharpness prediction value, energy consumption rate, etc. Then, based on the information entropy theory, the change degree of each sub-state component in a certain time window is counted, and the entropy value is calculated by adopting the following formula:

;

Wherein, the Representing state componentsIn the first placeProbability distribution in each value interval. Then, the entropy value of each sub-stateMapping to weight factorsThe following processing can be performed by normalization operation:

;

the above weight represents the importance of each state component to the current policy decision, and the higher the entropy value is, the greater the fluctuation of the state dimension is, and the more significant the influence on the policy stability is.

Finally, each component of the original state tensor is subjected to weighted fusion according to the obtained weight factors to generate a weighted state vectorAs a subsequent input to the state input of the reinforcement learning strategy model.

By introducing a state importance weighting mechanism based on information entropy, the key degree of different state features in the current task can be adaptively identified, and state information with obvious influence on policy decision is effectively highlighted. The mechanism improves the recognition degree and stability of the state representation, so that the reinforcement learning strategy has higher perception acuity and strategy adaptability when processing complex environment changes, and the accuracy and the robustness of unmanned plane gesture control are enhanced.

S400, inputting the weighted state vector into a Meta-SAC model with Meta-learning capability, and generating an output target flight reference pose parameter and a gain coefficient for dynamically adjusting an LQR controller through an Actor network;

Specifically, a weighted state vector generated by the spatiotemporal attention encoder and processed by the entropy weighting mechanism is taken as an environmental input of the current moment. The state vector contains multi-modal fusion features such as position errors, obstacle threat degrees, image sharpness predictors, and energy consumption rates.

The weighted state vector is taken as input and is sent into a trained Meta-SAC model with Meta-learning capability. The Actor network in the Meta-SAC uses a parameter migration mechanism to quickly adapt to the current environment state, and has the capacity of generalizing with few samples.

The Meta-SAC's Actor network outputs a continuous motion vector comprising two parts:

(1) Target flight reference pose parameters include target position coordinates, attitude angle (pitch, roll, yaw) and desired viewing angle direction of the onboard camera.

(2) And the LQR gain adjustment quantity is used for adaptively adjusting the feedback gain coefficient of the bottom LQR controller under the current task state so as to realize more sensitive or stable gesture tracking control.

After the motion vector is analyzed, the target reference pose is used as a control expected value to be input into the LQR controller, and the adjustment gain is used for updating the feedback matrix of the LQR, so that the dynamic cooperation of the high-level strategy and the bottom-level controller is realized. While preserving the current state vector, action output, prize value (to be calculated), and next state tuples for subsequent online or offline policy updates of the Meta-SAC model.

S500, driving an LQR controller to generate a low-layer flight control instruction to control the unmanned aerial vehicle to execute corresponding gesture actions based on the target flight reference gesture parameters and the LQR gain coefficients;

Specifically, the LQR controller receives target flight reference pose parameters from the Meta-SAC model output, including desired position coordinates, attitude angles (pitch, roll, yaw), and camera perspective adjustment instructions, as desired inputs to the system. The current flight state information, including the actual position, the speed, the attitude angle, the angular speed and the like, is acquired in real time through the unmanned aerial vehicle-mounted sensor to form a current state vector. And calculating the error between the expected pose and the current state to form a state error vector of the system. For example, the position error is the difference between the desired position and the current actual position, and the posture error is the difference between the desired posture and the actual posture. And adjusting a feedback gain matrix K of the LQR controller according to a dynamic feedback gain coefficient provided by the Meta-SAC model, wherein the matrix is used for determining the mapping weight of the state error vector to the control input, so as to realize self-adaptive adjustment. With the state error vector as a variable, defining a linear quadratic cost function:

;

Wherein, the The state error vector is represented as a vector of state errors,The control input vector is represented as such,Representing a state error penalty weight matrix, which is a symmetric positive definite matrix used to quantify the degree of penalty in the event of a state deviation from expectations,Representing a control energy penalty matrix, which is also a symmetric positive definite matrix, for quantifying the degree of penalty in the event of a state deviation from the desired,Typically given by the design of the skilled person,Then it is set to the unit diagonal matrix. Solving optimal feedback gain matrix by algebraic Riccati equationObtaining a feedback control law: i.e. control input From state error vectorsFeedback gainDetermining, finally, the control inputIs analyzed into the speed adjusting instructions of the accelerator, the control surface and the propeller of the aircraft on each shaft, and the generated low-layer flight control instructions.

The self-adaptive adjustment of the controller to different flight environments and task states is realized by introducing the feedback gain coefficient dynamically output by the high-level Meta-SAC model, the linear quadratic optimization model is built by combining the state errors, the attitude precision and the energy consumption efficiency are effectively balanced, and finally the output control instruction has quick response and high precision, so that the attitude stability and the flight safety of the unmanned aerial vehicle in a complex environment can be obviously improved.

S600, constructing a fuzzy logic rewarding function based on energy consumption efficiency and patrol image definition in the process of executing the flight control instruction, and performing strategy updating by taking the rewarding function value as feedback of the Meta-SAC model.

Specifically, key operation indexes of the unmanned aerial vehicle during the execution of the flight control instruction are collected, wherein the key operation indexes mainly comprise image definition indexes (such as image contrast, edge sharpness, detail retention and the like), energy consumption rate in unit time (such as power and speed relation and propulsion system power output) and relative distance between the unmanned aerial vehicle and environmental obstacles (Euclidean distance calculated based on a laser radar or a depth camera), and the continuous variable data are subjected to normalization processing, so that the numerical range of the continuous variable data is normalized to a range of [0,1] so that subsequent fuzzy calculation is facilitated.

The normalized variables are mapped into preset fuzzy membership functions respectively, for example, the image definition can be divided into three fuzzy sets of low consumption, medium consumption and high consumption, the energy consumption rate is divided into low consumption, medium consumption and high consumption, the obstacle distance is divided into near, medium and far, and the fuzzy sets are defined by triangle, trapezoid or Gaussian membership functions.

A fuzzy rule base is constructed, a plurality of fuzzy reasoning rules (such as high image definition, low energy consumption, long distance, high rewarding, fuzzy image, high energy consumption, short distance, low rewarding) are formulated according to task priority and strategy targets, fuzzy logic reasoning is carried out on the fuzzy language values, and a comprehensive fuzzy output result is calculated.

Finally, common fuzzy solving techniques such as a gravity center method are adopted to convert fuzzy reasoning results into specific reward function values (such as continuous values between [0,1 ]), and the values are used as instant feedback signals to be input into a Meta-SAC model so as to drive the updating of a strategy network, thereby improving the strategy convergence efficiency and the practical effect of flight decisions.

By fusing key indexes such as image definition, energy consumption efficiency, safety distance and the like and adopting a fuzzy reasoning mechanism to flexibly process complex state information, the problem of single threshold sensitivity and expression existing in the traditional rewarding function is effectively avoided, finer and robust optimization guidance of a flight strategy is realized, and therefore autonomous decision performance and execution stability of the unmanned aerial vehicle in a complex dynamic environment are improved.

Example two

The invention also provides an unmanned aerial vehicle flight attitude adjusting system based on reinforcement learning, which has a structure shown in fig. 2 and comprises:

In this embodiment, the unmanned aerial vehicle flight attitude adjustment system based on reinforcement learning is applied to a daily inspection task of a high-voltage transmission line, and the system is deployed on a multi-rotor unmanned aerial vehicle platform with an onboard high-definition camera, a laser radar and an inertial measurement unit. The specific implementation flow is as follows:

The route planning module calls a preloaded transmission line digital map, performs semantic segmentation by combining images acquired by an onboard camera in real time, and identifies scene elements such as towers, wires, vegetation and the like. And fusing the static geographic data with the information of dynamic obstacles (such as birds and construction equipment) to construct a three-dimensional dynamic environment raster image with time resolution of second level. In the grid, the current unmanned plane position is taken as a starting point, a patrol target point is taken as an end point, an RRT algorithm which introduces space-time constraint and Lyapunov stability judgment is adopted to plan an initial track, and a B spline curve is used for smoothing processing to generate a time track sequence with gesture reference.

Then, the information fusion module acquires data such as flight state (position, speed and gesture), laser radar environment point cloud, image definition index and the like in real time in the flight process of the unmanned aerial vehicle along the initial track, and inputs the data to the space-time attention encoder. The encoder fuses time dependence and space distribution information and outputs state tensors containing position information deviation, obstacle threat degree, image definition predicted value, unit energy consumption and the like.

Then, the tensor weighting module introduces an information entropy weighting mechanism to each dimension of the state tensor, dynamically adjusts the state variable weight according to the current environment complexity and task priority, and outputs a more representative weighted state vector. The vector is used as a Meta-SAC model input into a pose generation module, and an Actor network outputs a target flight reference pose (position and pose) and a corresponding LQR feedback gain adjustment quantity.

Then, the pose control module inputs the reference pose parameters and the dynamic gain coefficients to the LQR controller. The controller calculates a linear quadratic optimal solution based on the current state error, generates low-layer flight control instructions such as acceleration, attitude angular speed and the like, and drives the multi-rotor unmanned aerial vehicle to complete high-precision attitude tracking and track maintenance in real time.

And finally, constructing a fuzzy logic rewarding function by the feedback updating module according to the image quality score, the energy consumption index and the flight stability data acquired in the flight process, and taking the calculation result as the instant feedback of the reinforcement learning strategy model. After the task is finished, the system further optimizes the Meta-SAC strategy weight parameters through offline federal countermeasure training so as to improve the generalization performance and the adaptability of the model in a new environment.

The above formulas are all formulas with dimensions removed and numerical values calculated, the formulas are formulas with a large amount of data collected for software simulation to obtain the latest real situation, and preset parameters in the formulas are set by those skilled in the art according to the actual situation.

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Finally, the foregoing description of the preferred embodiment of the invention is provided for the purpose of illustration only, and is not intended to limit the invention to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

Claims

1. A method for adjusting the flight attitude of an unmanned aerial vehicle based on reinforcement learning, characterized by comprising:

A dynamic environment grid is constructed based on the digital map and real-time semantic segmentation results, and the initial track is generated by a fast search random tree algorithm with time and space constraints.

Obtain the UAV's flight status information on the initial trajectory, the environmental perception information of the onboard sensors, and the image clarity information, and input the above information into the spatiotemporal attention encoder to generate a state tensor that integrates the environmental semantics;

The information entropy weighting mechanism is introduced into the state tensor to distribute the state importance and obtain the weighted state vector;

The weighted state vector is input into the Meta-SAC model with meta-learning capability, and the Actor network is used to generate the output target flight reference pose parameters and the gain coefficients for dynamically adjusting the LQR controller.

Based on the target flight reference pose parameters and the LQR gain coefficient, the LQR controller is driven to generate low-level flight control instructions to control the UAV to perform corresponding posture actions;

During the execution of flight control instructions, a fuzzy logic reward function is constructed based on energy efficiency and inspection image clarity, and the reward function value is used as feedback for the Meta-SAC model to update the strategy.

2. The method for adjusting the flight attitude of an unmanned aerial vehicle based on reinforcement learning according to claim 1, wherein the step of constructing a dynamic environment grid is:

Based on the digital map of the target inspection area and the images collected by the onboard sensors, a semantic segmentation model is used to identify and classify obstacles in the environment.

Static geographic information and dynamic obstacle identification information are integrated to construct a three-dimensional dynamic environment grid map including the time dimension. The grid records the passability, obstacle type and dynamic behavior attributes of each area in the environment.

3. The method for adjusting the flight attitude of an unmanned aerial vehicle based on reinforcement learning according to claim 1, wherein the step of generating the initial track is:

In the dynamic environment grid, the current position of the UAV is used as the root node and the target point is used as the end point, and a fast search random tree algorithm is used to perform path extension;

Lyapunov stability constraints are introduced in path extension, and trajectory smoothing is performed using B-spline curves to output an initial flight trajectory sequence containing time and attitude information.

4. The method for adjusting the flight attitude of a UAV based on reinforcement learning according to claim 1, wherein the step of generating a state tensor integrating environmental semantics comprises:

Normalize and preprocess the UAV flight status information, environmental perception information, and image clarity information;

The preprocessed information is input into the spatiotemporal attention encoder, and the multi-head attention mechanism is used to fuse multimodal data;

Combining the semantic segmentation results, the spatial context and time series correlation features are extracted to construct a state tensor containing location information, obstacle threat level, image clarity prediction value and energy consumption estimation.

5. The method for adjusting the flight attitude of a UAV based on reinforcement learning according to claim 1, wherein the step of allocating state importance is as follows:

The degree of change of each sub-state component in the state tensor is measured based on the information entropy theory;

Convert the information entropy value into a weight factor to indicate the importance of each state dimension in policy decision-making;

The state tensors are weighted and fused according to the weight factors to generate a weighted state vector with state importance perception capability.

6. The method for adjusting the flight attitude of a UAV based on reinforcement learning according to claim 1, wherein the LQR controller generates low-level flight control instructions by:

Receive the target flight reference pose parameters output by the Meta-SAC model as the expected input;

The system state error vector is constructed based on the current flight state of the UAV, and the control instructions are calculated based on the dynamically adjusted feedback gain coefficient;

By solving the linear quadratic optimization problem to minimize the state error and control energy consumption cost function, the low-level control quantity is output to drive the aircraft to perform corresponding attitude adjustments.

7. The method for adjusting the flight attitude of a UAV based on reinforcement learning according to claim 1, wherein the step of constructing the fuzzy logic reward function is:

Collect the image clarity index, energy consumption rate and relative distance of obstacles of the UAV during flight, and normalize the above continuous variables;

The normalized input variables are converted into fuzzy variables, which are mapped to the preset fuzzy membership functions to obtain the corresponding fuzzy language values;

Fuzzy reasoning is performed based on a preset fuzzy rule library. The reasoning rules include a combination of image clarity, energy consumption, and obstacle distance.

The inference results are converted into specific numerical rewards through the defuzzification method, which serves as the immediate feedback signal of the Meta-SAC model to update the policy network.

8. A UAV flight attitude adjustment system based on reinforcement learning, characterized by including:

The route planning module is used to construct a dynamic environment grid based on the digital map and real-time semantic segmentation results, and generate the initial track by introducing a fast search random tree algorithm with spatiotemporal constraints;

The information fusion module is used to obtain the flight status information of the UAV on the initial track, the environmental perception information of the onboard sensors, and the image clarity information, and input the above information into the spatiotemporal attention encoder to generate a state tensor that integrates the environmental semantics;

A tensor weighting module is used to introduce an information entropy weighting mechanism into the state tensor to distribute state importance and obtain a weighted state vector;

The pose generation module is used to input the weighted state vector into the Meta-SAC model with meta-learning capabilities, and generate the output target flight reference pose parameters and the gain coefficient for dynamically adjusting the LQR controller through the Actor network;

A posture control module is used to drive the LQR controller to generate low-level flight control instructions based on the target flight reference posture parameters and the LQR gain coefficient to control the UAV to perform corresponding posture actions;

The feedback update module is used to construct a reward function based on energy efficiency and inspection image clarity during the execution of flight control instructions, and use the reward function value as feedback to the Meta-SAC model to update the strategy.