Disclosure of Invention
The invention aims to solve the technical problem of providing a ship task self-adaptive planning method and device based on deep reinforcement learning, which are used for training and optimizing a decision model through a complex deep reinforcement learning algorithm by combining historical experience data and personnel data. By the method, not only can efficient task planning be realized, but also personalized action suggestions can be provided according to individual differences of personnel, so that the execution efficiency and success rate of the overall task are improved.
In order to solve the technical problems, a first aspect of the embodiments of the present invention discloses a method for adaptive planning of a ship task based on deep reinforcement learning, the method comprising:
S1, acquiring ship task planning data information, wherein the ship task planning data information comprises historical data information and personnel planning data information;
s2, preprocessing the ship task planning data information to obtain preprocessed data information;
s3, training a preset deep reinforcement learning planning model by utilizing the preprocessing data information to obtain an optimized deep reinforcement learning planning model;
And S4, processing the ship task data information to be processed by using the optimized deep reinforcement learning planning model to obtain a ship task planning scheme.
In a first aspect of the embodiment of the present invention, the preprocessing the ship mission planning data information to obtain preprocessed data information includes:
S21, carrying out data cleaning on the ship task planning data information to obtain cleaning data information;
s22, carrying out data conversion on the cleaning data information to obtain standard data information;
s23, extracting the characteristics of the standard data information to obtain the preprocessed data information.
In a first aspect of the embodiment of the present invention, the feature extracting the standard data information to obtain the preprocessed data information includes:
S231, extracting features of the standard data information to obtain first feature information;
S232, carrying out feature extraction on the standard data information to obtain second feature information;
S233, performing mapping interpolation processing on the first characteristic information to obtain first mapping characteristic information;
S234, performing mapping interpolation processing on the second characteristic information to obtain second mapping characteristic information;
S235, the first mapping characteristic information and the second mapping characteristic information are processed to obtain preprocessed data information.
In a first aspect of the embodiment of the present invention, the feature extracting the standard data information to obtain first feature information includes:
Performing feature extraction on the standard data information by using a preset first feature extraction model to obtain first feature information;
the first feature extraction model expression is:
Wherein C s (t, f) is first characteristic information, f is frequency, t is time, f (ζ, τ) represents kernel function, s (t) is input standard data information, τ is time shift, ζ is frequency shift, u is integral variable, Exp is an exponential function, σ is a constant, and x represents taking the conjugate.
In a first aspect of the embodiment of the present invention, the performing a mapping interpolation process on the first feature information to obtain first mapping feature information includes:
Carrying out mapping interpolation processing on the first characteristic information by using a preset mapping interpolation model to obtain first mapping characteristic information;
the preset mapping interpolation model expression is as follows:
Wherein M is first characteristic information, M 'is first mapping characteristic information, M i′j is an element in M', M min is a minimum value in M, M max is a maximum value in M, i is more than or equal to 1 and less than or equal to M, j is more than or equal to 1 and less than or equal to n, and M and n are the row and column lengths of the matrix M.
In a first aspect of the embodiment of the present invention, the processing the first mapping feature information and the second mapping feature information to obtain preprocessed data information includes:
S2351, processing the first mapping characteristic information to obtain an autocovariance matrix Sigma 11;
S2352, processing the second mapping characteristic information to obtain an autocovariance matrix Sigma 22;
S2353, processing the first mapping characteristic information and the second mapping characteristic information to obtain a first cross covariance matrix Sigma 12 and a second cross covariance matrix Sigma 21;
S2354, solving a preset optimization objective function to obtain transformation matrices W x and W y;
The preset optimization objective function is as follows:
Wherein, Q 1=Σ11,Q2=Σ22 is selected from the group consisting of, Beta, lambda and theta are coefficient constants, and T is a transposition;
S2355, the transformation matrices W x and W y are processed to obtain preprocessed data information.
In a first aspect of the embodiment of the present invention, the processing the ship task data information to be processed to obtain a ship task planning scheme by using the optimized deep reinforcement learning planning model includes:
s41, processing ship task data information to be processed by using the optimized deep reinforcement learning planning model to obtain decision information;
and S42, carrying out data processing on the decision information and the personnel planning data information to obtain a ship task planning scheme.
The second aspect of the embodiment of the invention discloses a ship task self-adaptive planning device based on deep reinforcement learning, which comprises the following components:
The information acquisition module is used for acquiring ship task planning data information, wherein the ship task planning data information comprises historical data information and personnel planning data information;
The preprocessing module is used for preprocessing the ship task planning data information to obtain preprocessed data information;
The model training module is used for training a preset deep reinforcement learning planning model by utilizing the preprocessing data information to obtain an optimized deep reinforcement learning planning model;
And the ship task planning module is used for processing the ship task data information to be processed by utilizing the optimized deep reinforcement learning planning model to obtain a ship task planning scheme.
In a second aspect of the embodiment of the present invention, the preprocessing the ship mission planning data information to obtain preprocessed data information includes:
S21, carrying out data cleaning on the ship task planning data information to obtain cleaning data information;
s22, carrying out data conversion on the cleaning data information to obtain standard data information;
s23, extracting the characteristics of the standard data information to obtain the preprocessed data information.
In a second aspect of the embodiment of the present invention, the feature extracting the standard data information to obtain the preprocessed data information includes:
S231, extracting features of the standard data information to obtain first feature information;
S232, carrying out feature extraction on the standard data information to obtain second feature information;
S233, performing mapping interpolation processing on the first characteristic information to obtain first mapping characteristic information;
S234, performing mapping interpolation processing on the second characteristic information to obtain second mapping characteristic information;
S235, the first mapping characteristic information and the second mapping characteristic information are processed to obtain preprocessed data information.
In a second aspect of the embodiment of the present invention, the feature extracting the standard data information to obtain first feature information includes:
Performing feature extraction on the standard data information by using a preset first feature extraction model to obtain first feature information;
the first feature extraction model expression is:
Wherein C s (t, f) is first characteristic information, f is frequency, t is time, f (ζ, τ) represents kernel function, s (t) is input standard data information, τ is time shift, ζ is frequency shift, u is integral variable, Exp is an exponential function, σ is a constant, and x represents taking the conjugate.
In a second aspect of the embodiment of the present invention, the performing a mapping interpolation process on the first feature information to obtain first mapping feature information includes:
Carrying out mapping interpolation processing on the first characteristic information by using a preset mapping interpolation model to obtain first mapping characteristic information;
the preset mapping interpolation model expression is as follows:
Wherein M is first characteristic information, M 'is first mapping characteristic information, M i′j is an element in M', M min is a minimum value in M, M max is a maximum value in M, i is more than or equal to 1 and less than or equal to M, j is more than or equal to 1 and less than or equal to n, and M and n are the row and column lengths of the matrix M.
In a second aspect of the embodiment of the present invention, the processing the first mapping feature information and the second mapping feature information to obtain preprocessed data information includes:
S2351, processing the first mapping characteristic information to obtain an autocovariance matrix Sigma 11;
S2352, processing the second mapping characteristic information to obtain an autocovariance matrix Sigma 22;
S2353, processing the first mapping characteristic information and the second mapping characteristic information to obtain a first cross covariance matrix Sigma 12 and a second cross covariance matrix Sigma 21;
S2354, solving a preset optimization objective function to obtain transformation matrices W x and W y;
The preset optimization objective function is as follows:
Wherein, Q 1=Σ11,Q2=Σ22 is selected from the group consisting of, Beta, lambda and theta are coefficient constants, and T is a transposition;
S2355, the transformation matrices W x and W y are processed to obtain preprocessed data information.
In a second aspect of the embodiment of the present invention, the processing the ship task data information to be processed to obtain a ship task planning scheme by using the optimized deep reinforcement learning planning model includes:
s41, processing ship task data information to be processed by using the optimized deep reinforcement learning planning model to obtain decision information;
and S42, carrying out data processing on the decision information and the personnel planning data information to obtain a ship task planning scheme.
The third aspect of the invention discloses another adaptive planning device for ship tasks based on deep reinforcement learning, which comprises:
a memory storing executable program code;
A processor coupled to the memory;
The processor invokes the executable program codes stored in the memory to execute part or all of the steps in the adaptive planning method for the ship task based on deep reinforcement learning disclosed in the first aspect of the embodiment of the invention.
A fourth aspect of the present invention discloses a computer-readable storage medium storing computer instructions for performing part or all of the steps in the deep reinforcement learning-based adaptive planning method for a marine task disclosed in the first aspect of the present invention when the computer instructions are called.
Compared with the prior art, the embodiment of the invention has the following beneficial effects:
(1) In the prior art, a ship task planning system often has difficulty in processing high-dimensional, nonlinear and time sequence data, so that abundant historical experience data cannot be fully utilized to optimize decisions. The invention can more effectively process the complex data by introducing a deep reinforcement learning algorithm, and extract valuable features and modes from the complex data, thereby providing more accurate and comprehensive data support for task planning.
(2) The real-time adaptability is enhanced, namely, the current system generally lacks the capability of adjusting strategies in real time when facing rapidly-changing environment and task conditions, so that decision delay and inaccuracy are caused. The invention enables the system to dynamically adjust the strategy according to the real-time situation in the task execution process by a real-time feedback mechanism and online learning capability, and ensures the timeliness and accuracy of decision making.
(3) The personalized decision support is realized, the individual difference and skill level of ship personnel are rarely considered in the prior art, and the targeted decision support cannot be provided. According to the invention, by combining personnel data, including individual differences of personnel skills, experiences and the like, personalized action suggestions can be generated for each person, so that the potential of the personnel is brought into full play to the maximum extent, and the overall combat efficiency is improved.
(4) Simplifying the model update process-traditional models often require retraining or extensive adjustments in the face of new data or new conditions, which is time consuming and laborious and may affect the stability of the model. The model design of the invention considers the requirements of continuous learning and incremental updating, and can quickly and effectively update the model when new data arrives without retraining the whole model.
Detailed Description
In order to make the present invention better understood by those skilled in the art, the following description will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The terms first, second and the like in the description and in the claims and in the above-described figures are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, apparatus, article, or device that comprises a list of steps or elements is not limited to the list of steps or elements but may, in the alternative, include other steps or elements not expressly listed or inherent to such process, method, article, or device.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the invention. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
The invention discloses a self-adaptive planning method and device for ship tasks based on deep reinforcement learning, wherein the method comprises the steps of obtaining ship task planning data information; the ship task planning data information comprises historical data information and personnel planning data information, the ship task planning data information is preprocessed to obtain preprocessed data information, a preset deep reinforcement learning planning model is trained by utilizing the preprocessed data information to obtain an optimized deep reinforcement learning planning model, and ship task planning schemes to be obtained by utilizing the optimized deep reinforcement learning planning model to process ship task data information to be processed. The following will describe in detail.
Example 1
Referring to fig. 1, fig. 1 is a schematic flow chart of a ship task adaptive planning method based on deep reinforcement learning according to an embodiment of the present invention. The adaptive planning method for the ship task based on the deep reinforcement learning described in fig. 1 is applied to the technical field of ship task planning, and the embodiment of the invention is not limited. As shown in fig. 1, the adaptive planning method for the ship task based on the deep reinforcement learning may include the following operations:
S1, acquiring ship task planning data information, wherein the ship task planning data information comprises historical data information and personnel planning data information;
s2, preprocessing the ship task planning data information to obtain preprocessed data information;
s3, training a preset deep reinforcement learning planning model by utilizing the preprocessing data information to obtain an optimized deep reinforcement learning planning model;
And S4, processing the ship task data information to be processed by using the optimized deep reinforcement learning planning model to obtain a ship task planning scheme.
Optionally, the preprocessing the ship mission planning data information to obtain preprocessed data information includes:
S21, carrying out data cleaning on the ship task planning data information to obtain cleaning data information;
removing abnormal values in the ship mission planning data information, and performing interpolation processing on the missing values to obtain cleaning data information;
the abnormal value is data information with the average value of the data being larger than a preset threshold value;
the missing value is information which is missing in ship mission planning data information, and the missing value is obtained by comparing the missing value with experience information;
s22, carrying out data conversion on the cleaning data information to obtain standard data information;
and carrying out normalization processing on the cleaning data information, dividing the cleaning data information by a maximum value, and obtaining standard data information which belongs to [0,1].
S23, extracting the characteristics of the standard data information to obtain the preprocessed data information.
Optionally, the feature extracting the standard data information to obtain preprocessed data information includes:
S231, extracting features of the standard data information to obtain first feature information;
S232, carrying out feature extraction on the standard data information to obtain second feature information;
S233, performing mapping interpolation processing on the first characteristic information to obtain first mapping characteristic information;
S234, performing mapping interpolation processing on the second characteristic information to obtain second mapping characteristic information;
S235, the first mapping characteristic information and the second mapping characteristic information are processed to obtain preprocessed data information.
Optionally, the feature extracting the standard data information to obtain first feature information includes:
Performing feature extraction on the standard data information by using a preset first feature extraction model to obtain first feature information;
the first feature extraction model expression is:
Wherein C s (t, f) is first characteristic information, f is frequency, t is time, f (ζ, τ) represents kernel function, s (t) is input standard data information, τ is time shift, ζ is frequency shift, u is integral variable, Exp is an exponential function, σ is a constant, and x represents taking the conjugate.
Optionally, the performing a mapping interpolation process on the first feature information to obtain first mapping feature information includes:
Carrying out mapping interpolation processing on the first characteristic information by using a preset mapping interpolation model to obtain first mapping characteristic information;
the preset mapping interpolation model expression is as follows:
Wherein M is first characteristic information, M 'is first mapping characteristic information, M i′j is an element in M', M min is a minimum value in M, M max is a maximum value in M, i is more than or equal to 1 and less than or equal to M, j is more than or equal to 1 and less than or equal to n, and M and n are the row and column lengths of the matrix M.
Optionally, the processing the first mapping feature information and the second mapping feature information to obtain preprocessed data information includes:
S2351, processing the first mapping characteristic information to obtain an autocovariance matrix Sigma 11;
S2352, processing the second mapping characteristic information to obtain an autocovariance matrix Sigma 22;
S2353, processing the first mapping characteristic information and the second mapping characteristic information to obtain a first cross covariance matrix Sigma 12 and a second cross covariance matrix Sigma 21;
S2354, solving a preset optimization objective function to obtain transformation matrices W x and W y;
The preset optimization objective function is as follows:
Wherein, Q 1=Σ11,Q2=Σ22 is selected from the group consisting of, Beta, lambda and theta are coefficient constants, and T is a transposition;
S2355, the transformation matrices W x and W y are processed to obtain preprocessed data information.
The preprocessing data information is as follows:
z is the pre-processed data information, X 1 is the first mapping characteristic information, and X 2 is the second mapping characteristic information.
Extracting features of the standard data information to obtain second feature information, wherein the feature extraction comprises the following steps:
EMD decomposition is carried out on standard data information, the standard data information is decomposed into IMFs with different time scales, simple components and relatively stable, all maximum value points and minimum value points of x '(t) are determined, a polynomial interpolation is utilized to obtain corresponding upper envelope e max (t) and lower envelope e min (t), mean m (t) = (e min(t)+emax (t))/2 is obtained according to the upper envelope and the lower envelope, details h (t) are extracted according to the x' (t) and the mean m (t), h (t) = x '(t) -m (t), if h (t) accords with IMF, the first modal component IMF1 (marked as c 1 (t)) is obtained, otherwise, h (t) = x' (t) is continuously decomposed, and finally the standard data information is decomposed into Wherein r (t) is the remainder. Normalizing the processing function toWherein x max is a maximum value of data information, x min is a minimum value of data information, the data information is x (t), and x' (t) is normalized data information.
Optionally, the processing the ship task data information to be processed by using the optimized deep reinforcement learning planning model to obtain a ship task planning scheme includes:
s41, processing ship task data information to be processed by using the optimized deep reinforcement learning planning model to obtain decision information;
and S42, carrying out data processing on the decision information and the personnel planning data information to obtain a ship task planning scheme.
The deep reinforcement learning planning model is multi-reward reinforcement learning (MRRL), which is a generalization of standard single-reward reinforcement learning, so the deep reinforcement learning planning model also takes a Markov decision process as a framework, and consists of (S, A, P, R and gamma), wherein R is a reward function, A is an action space, S is a state space, gamma epsilon (0, 1) is a discount factor, and P is a state transition probability.
The strategy pi is:
π(a|s)=p[At=a|St=s]
as shown in the formula, action a is selected when the state is s, namely, the formulation of a strategy.
But differs from reinforcement learning of a single prize in that the prize function of MRRL returns not a scalar value but a scalar vector reflecting each prize value m as a result.
R(s,a,a')=(R1(s,a,s'),…,Rm(s,a,s'))
The strategy in this case is also determined by a set of vectors of expected value:
the goal of multi-reward reinforcement learning is to find the optimal decision of the whole system or approach the optimal decision of the whole system. In solving the multi-reward reinforcement learning problem, when the Q-learning reinforcement learning method is used, the Q value of each model can be learned in parallel, and the Q values can be stored in the form of vectors as well:
The most common way to derive strategies from these estimates is to calculate a linear scalar or weighted sum from the Q vector and the weight vector w:
The weight vector w represents the weight of each objective function, and for multiple rewards, which rewards should be resolved preferentially, the weight of which rewards is bigger, however, it is difficult and often not scientific to implement multiple rewards trade-off by setting these weights a priori, and a large number of parameters are usually needed for the setting of the weights to adjust.
Example two
Referring to fig. 2, fig. 2 is a flow chart of another adaptive planning method for a ship task based on deep reinforcement learning according to an embodiment of the present invention. The adaptive planning method for the ship task based on the deep reinforcement learning described in fig. 2 is applied to the technical field of ship task planning, and the embodiment of the invention is not limited. As shown in fig. 2, the adaptive planning method for the ship task based on the deep reinforcement learning may include the following operations:
the invention aims to provide a ship task self-adaptive planning system based on deep reinforcement learning, which solves the defects that the data processing capacity is limited, the real-time adaptability is lacking, personalized decisions cannot be provided, and the multi-agent cooperation capability is insufficient in the prior art.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
The system structure comprises a data preprocessing module, a deep reinforcement learning module, a personalized decision support module and a multi-person intelligent cooperation feedback module. All modules are connected with each other through data flow and control flow, so that information transmission and processing are realized. And the data preprocessing module is responsible for extracting relevant data from the historical experience database and the personnel database, and cleaning, converting and extracting features so as to adapt to the requirements of a deep reinforcement learning algorithm.
The ship tasks comprise a navigation task, a cargo loading and unloading task, a ship maintenance task and the like, the task types are determined according to the task requests and are transmitted to corresponding processing modules, and key characteristics including ship states, environment parameters, task related characteristics and the like are extracted.
And the deep reinforcement learning module adopts a complex deep reinforcement learning algorithm, such as PPO or A2C, and combines a simulator to carry out model training. The module is capable of handling high-dimensional state space and action space, autonomous learning and improvement through interaction with the environment.
And the personalized decision support module is used for generating personalized action suggestions for each person by combining personnel data such as skills, experiences and the like. The module displays suggestions through a user interaction interface and collects feedback of personnel so as to realize personalized decision support.
And the multi-agent cooperation module coordinates respective actions through the multi-agent system in the case that a plurality of ships or departments need to cooperate. The module ensures that the information sharing and communication are smooth so as to realize the efficient completion of the whole task.
And data preprocessing, namely extracting relevant data from a historical experience database and a personnel database, wherein the relevant data comprise task execution records, environment information, personnel skills and the like. And cleaning, converting and extracting features of the data to obtain a data set suitable for the deep reinforcement learning algorithm.
The data preprocessing is responsible for extracting relevant data from a historical experience database and a personnel database, and cleaning, converting and extracting features so as to adapt to the requirements of a deep reinforcement learning algorithm.
Historical experience databases include, but are not limited to, travel logs, incident reports, historical task performance, and the like. These data cover the operational experience and performance of the vessel under different environmental conditions.
Data cleaning, namely removing repeated records, outliers and incomplete data.
And data conversion, namely uniformly converting the data in different formats and units into a format suitable for processing by a deep reinforcement learning algorithm, such as converting time series data into window data with fixed length.
The feature extraction method comprises the following steps:
(1) And compressing ship track data by using Laplace feature mapping (LAPLACIAN EIGENMAPS) and Gaussian kernel density Estimation (G-KDE) to extract key steering points.
(2) And clustering the steering points by using a fuzzy self-adaptive DBSCAN method, and identifying common steering areas.
(3) And extracting the ship characteristics and the task related characteristics as input of a deep reinforcement learning model.
Deep reinforcement learning, namely performing model training by adopting complex deep reinforcement learning algorithms such as PPO or A2C and the like. The historical scene is reproduced by constructing a simulator and expanding the dataset with data enhancement techniques. The model is constantly interacted with the environment in the training process, and the strategy is adjusted according to feedback so as to realize autonomous learning and improvement.
Complex deep reinforcement learning algorithms, such as PPO or A2C, are used in conjunction with simulators to perform model training. The module is capable of handling high-dimensional state space and action space, autonomous learning and improvement through interaction with the environment.
The simulator can simulate the sailing conditions of the ship under different sea conditions, meteorological conditions, mission planning and the like, and provides a rich training environment for the deep reinforcement learning model.
The model training flow is as follows:
the algorithm is selected by adopting a near-end strategy optimization (PPO) or a deep reinforcement learning algorithm such as a dominant actor commentator (A2C).
Simulator construction, namely constructing a simulator based on historical experience data and personnel data.
Training data, namely taking data in a historical experience database as training samples, wherein the training samples comprise ship states, tasks, personnel and corresponding action decisions and result feedback.
And the training process is to input the preprocessed data into a deep reinforcement learning model, and perform multi-round iterative training through a simulator, so as to continuously adjust model parameters to optimize a decision strategy.
Personalized decision support, namely generating personalized action suggestions for each person by combining personnel data. And displaying the advice through a user interaction interface, and collecting feedback of personnel. The system adjusts and optimizes the advice according to the feedback to provide decision support that better meets the actual needs.
Personalized decision support combines personnel data, such as skills, experience, etc., to generate personalized action suggestions for each person. The module displays suggestions through a user interaction interface and collects feedback of personnel so as to realize personalized decision support.
The specific steps of personalized decision support are as follows:
(1) Personnel data analysis, which collects and analyzes personnel data including skill levels, historical operating records, preference settings, and the like.
(2) Decision advice generation, in which the output of the deep reinforcement learning model and personnel data are combined to generate personalized action advice for each personnel.
(3) User interaction and feedback, namely displaying personalized suggestions through a user interaction interface, and collecting real-time feedback of personnel for further optimizing a decision model.
(4) And updating the model, namely dynamically adjusting decision model parameters according to personnel feedback to realize continuous optimization of personalized decisions.
Intelligent collaboration by multiple personnel, where multiple vessels or departments are required to collaborate, the respective actions are coordinated by a multi-agent system. The system ensures that the information sharing and communication are smooth so that each ship or department can know the state and intention of each other in real time, thereby realizing the efficient completion of the whole task.
In the case where multiple vessels or departments are required to cooperate, the respective actions are coordinated by a multi-agent system. The module ensures that the information sharing and communication are smooth so as to realize the efficient completion of the whole task.
The specific implementation mode is as follows:
The agent definition is that each ship or key department is mapped into an agent, and each agent has independent deep reinforcement learning model and decision capability.
And the information sharing and communication are carried out, namely a multi-agent communication system is constructed, and the real-time sharing of state information, task progress and decision results among agents is ensured.
And collaborative decision making, namely extracting key data which is beneficial to collaborative decision by using a attention push processing method, and accumulating interactive experience by using an experience learning method driven by design memory. And a multi-head attention mechanism and a noise network are introduced, so that the exploration capability and the robustness of the decision-making of the intelligent agent are enhanced.
And the conflict resolution and policy optimization are that when decision conflict occurs, the conflict is resolved through a negotiation mechanism or a preset rule, so that the high-efficiency completion of the whole task is ensured. Meanwhile, according to feedback in the task execution process, the decision strategy of each agent is dynamically adjusted, and the cooperative optimization closed-loop control is realized.
The deep reinforcement learning flow chart 3 shows a model training process, which comprises the steps of data input, feature extraction, model training, strategy output and the like. Through continuous iteration and optimization, the model can gradually adapt to environmental changes and propose a better decision strategy.
Environment initialization-the environment provides the agent with initial state data that is the basis for understanding the current environment.
And the feature extraction is carried out after the state data of the environment is received, and the original data is converted into feature vectors which can be understood and processed by the model.
Model training, namely training a deep reinforcement learning model based on the extracted features. By adjusting the parameters of the model, the ability of predicting the optimal action according to the current state is continuously improved.
The dynamic programming equation is adopted in the implementation process of the deep reinforcement learning, and the formula is as follows:
The term pi (a|s) in the equation is indeed a probability distribution representing the probability of selecting action a given state s, and q (s, a) is a function of the action value that estimates the expected return that action a will achieve given state s.
(1) And outputting an action strategy, namely outputting the action strategy by the model according to the characteristics of the current state after training.
(2) And (3) environment feedback, namely after the action is executed, the environment gives out a corresponding reward signal and updates the reward signal to a new state, and the new state provides the basis for the next decision.
(3) Model optimization, namely optimizing the model according to the reward signal and the new state fed back by the environment.
Through continuous iteration, the model gradually adapts to the environment, and the accuracy of decision making is improved.
(4) And (3) iterative loop, namely taking the new state as input of the optimized model, and repeating the steps to form a closed loop iterative process. This process will continue until the model reaches a convergence state or a preset stop condition is met.
Example III
Please refer to in figure 4 of the drawings, fig. 4 is a schematic structural diagram of a ship task adaptive planning device based on deep reinforcement learning according to an embodiment of the present invention. The adaptive planning device for the ship task based on the deep reinforcement learning described in fig. 4 is applied to the technical field of ship task planning, and the embodiment of the invention is not limited. As shown in figure 4 of the drawings, the ship task self-service based on deep reinforcement learning the adaptive planning apparatus may include the following operations:
S301, an information acquisition module is used for acquiring ship task planning data information, wherein the ship task planning data information comprises historical data information and personnel planning data information;
s302, a preprocessing module is used for preprocessing the ship task planning data information to obtain preprocessed data information;
s303, a model training module is used for training a preset deep reinforcement learning planning model by utilizing the preprocessing data information to obtain an optimized deep reinforcement learning planning model;
and S304, a ship task planning module is used for processing ship task data information to be processed by using the optimized deep reinforcement learning planning model to obtain a ship task planning scheme.
Example IV
Referring to fig. 5, fig. 5 is a schematic structural diagram of a ship task adaptive planning device based on deep reinforcement learning according to an embodiment of the present invention. The adaptive planning device for the ship task based on the deep reinforcement learning described in fig. 5 is applied to the technical field of ship task planning, and the embodiment of the invention is not limited. As shown in fig. 5, the adaptive planning apparatus for a ship mission based on deep reinforcement learning may include the following operations:
a memory 401 storing executable program codes;
A processor 402 coupled with the memory 401;
The processor 402 invokes executable program code stored in the memory 401 for performing the steps in the deep reinforcement learning based marine task adaptive planning method described in embodiment one, embodiment two.
Example five
The embodiment of the invention discloses a computer readable storage medium storing a computer program for electronic data exchange, wherein the computer program enables a computer to be used for executing the steps in the ship task adaptive planning method based on deep reinforcement learning described in the first and second embodiments.
The apparatus embodiments described above are merely illustrative, in which the modules illustrated as separate components may or may not be physically separate, and the components shown as modules may or may not be physical, i.e., may be located in one place, or may be distributed over multiple network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above detailed description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course by means of hardware. Based on such understanding, the foregoing technical solutions may be embodied essentially or in part in the form of a software product that may be stored in a computer-readable storage medium including Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), programmable Read-Only Memory (Programmable Read-Only Memory, PROM), erasable programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), one-time programmable Read-Only Memory (OTPROM), electrically erasable programmable Read-Only Memory (EEPROM), compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM) or other optical disc Memory, magnetic disc Memory, tape Memory, or any other medium that can be used for computer-readable carrying or storing data.
Finally, it should be noted that the disclosed adaptive planning method and device for ship tasks based on deep reinforcement learning are only preferred embodiments of the present invention, and are only used for illustrating the technical scheme of the present invention, but not limiting the technical scheme; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that modifications may be made to the technical solutions described in the foregoing embodiments or equivalents may be substituted for some of the technical features thereof, and that these modifications or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention in essence of the corresponding technical solutions.