Disclosure of Invention
The application provides a beam line station parameter optimization method based on extended Kalman filtering and reinforcement learning, which uses a Kalman filtering and reinforcement learning combined method to relieve the influence caused by systematic errors in the equipment parameter tuning process and improve the accuracy of state estimation, thereby ensuring that the strategy learning is more accurate. The application provides the following technical scheme:
In a first aspect, the present application provides a method for optimizing parameters of a beam line station based on extended kalman filtering and reinforcement learning, the method comprising:
based on an initial strategy and a preset target state, randomly selecting a plurality of initial states from the environment, sampling, and collecting a plurality of pieces of track data consisting of continuous experience quaternions;
in the first round of sampling, training a probabilistic neural network by using the collected trajectory data to obtain a state transition model;
For each piece of track data, carrying out extended Kalman filtering by combining the state transition model, replacing the filtered next moment state into an experience quadruple of each piece of track data, and storing the new experience quadruple into an experience playback pool;
and randomly sampling the experience quadruple from the experience playback pool by using DDPG algorithm, learning and updating the current strategy to obtain a new strategy, and circulating until the strategy learning is completed.
In a specific embodiment, training the probabilistic neural network using the collected trajectory data to obtain the state transition model includes:
After the first round of sampling is finished and a plurality of pieces of track data are collected, training a preset probabilistic neural network by using the collected plurality of pieces of track data to obtain a state transition model, wherein the state transition model is as follows:
;
Wherein, AndRepresenting the current state and the current action respectively,As a model parameter of the probabilistic neural network,As a mean value vector of the data set,Is a variance vector.
In a specific embodiment, said developing extended kalman filtering for each trace data in combination with said state transition model, replacing the filtered next time state into an experience quadruple of each trace data and saving the new experience quadruple into an experience playback pool comprises:
predicting the state at the current moment through a state transition model and calculating an error covariance matrix of the state;
Introducing an observation value and a Kalman gain, and correcting a predicted state value and an error covariance matrix;
after the correction is completed, updating the error covariance matrix, and replacing the corrected predicted state value as the state value at the next moment into the original experience quaternion to form a new experience quaternion.
In a specific embodiment, the predicting the state at the current time by the state transition model and calculating the error covariance matrix of the state includes:
predicting current time using state transition model The state of (2) is as follows:
;
for the current moment The prediction formula of the error covariance is as follows:
;
Wherein, Is a state transition model with respect to stateThe jacobian matrix at the current moment,Is the covariance matrix of the data set,Is the error covariance matrix of the update phase at the previous time,Is the covariance matrix of the current moment prediction phase,Is the process noise covariance matrix and,Representing the transpose operation.
In a specific embodiment, the predicting the state at the current time by the state transition model and calculating the error covariance matrix of the state further includes:
Using N consecutive empirical quaternions in the current trajectory Calculating the error of each experience quadruple by using a state transition modelThe following are provided:
;
;
Wherein, Calculating the average error of all empirical quaternionsThe following are provided:
;
calculated using the following formula :
;
Wherein, Indicating the operation of the transpose,Representing the first of the track dataA number of four-tuple samples,AndRepresenting the last time and the current time in the experience quadruple respectively.
In a specific embodiment, the introducing the observed value and the kalman gain, and the correcting the predicted state value and the error covariance matrix includes:
calculating the difference between the observed value and the predicted value The following are provided:
;
kalman gain is calculated using the following formula :
;
Wherein, The diagonal matrix is calculated and constructed by using continuous track data in an experience playback pool in a dynamic estimation mode, N continuous experience samples are sampled in the current track, and the N continuous experience samples are takenAssume that there isComponents of each dimension, calculating standard deviation of each component,The calculation formula of (2) is as follows:
;
the correction formula of the predicted state value is as follows:
;
the correction formula of the error covariance matrix is as follows:
;
and the corrected error covariance matrix.
In a specific embodiment, the predicting the state at the current time by the state transition model and calculating the error covariance matrix of the state further includes:
error covariance emergency call requires an initialization operation using trajectory data for N consecutive empirical quaternions in the current trajectory Assume thatHas the following componentsThe components of the individual dimensions are such that,Calculate NStandard deviation of each component of (2)The initial value calculation formula of the error covariance is as follows:
;
;
Wherein, Representing the mean value of the first component,Represent the firstStatus in stripe track dataRespectively calculating covariance matrix of each empirical quaternion observation value and predicted value, and calculating average value as initial value of error covariance in prediction stage。
In a second aspect, the application provides a beam line station parameter optimization system based on extended kalman filtering and reinforcement learning, which adopts the following technical scheme:
A beam line station parameter optimization system based on extended kalman filtering and reinforcement learning, comprising:
The track data acquisition module is used for randomly selecting a plurality of initial states from the environment and sampling the initial states based on an initial strategy and a preset target state, and collecting a plurality of track data consisting of continuous experience quadruples;
The state transition model generation module is used for training the probabilistic neural network by using the collected track data in the first round of sampling to obtain a state transition model;
The extended Kalman filtering module is used for carrying out extended Kalman filtering on each piece of track data by combining the state transition model, replacing the filtered next time state into an experience quadruple of each piece of track data and storing the new experience quadruple into an experience playback pool;
and the strategy learning module is used for randomly sampling the experience quadruple from the experience playback pool by using DDPG algorithm, learning and updating the current strategy to obtain a new strategy, and circulating until the strategy learning is completed.
In a third aspect, the application provides an electronic device, which comprises a processor and a memory, wherein a program is stored in the memory, and the program is loaded and executed by the processor to realize the beam line station parameter optimization method based on the extended Kalman filtering and reinforcement learning according to the first aspect.
In a fourth aspect, the present application provides a computer readable storage medium having a program stored therein, which when executed by a processor is configured to implement a beam line station parameter optimization method based on extended kalman filtering and reinforcement learning as described in the first aspect.
In summary, the beneficial effects of the present application at least include:
1) In the process of optimizing parameters of a beam line station, equipment errors can cause deviation between state estimation and actual equipment states, the errors can influence the learning effect of a reinforcement learning strategy, and especially under a sparse rewarding environment, the strategy optimization is more difficult due to the accumulation of the errors. According to the application, the state is predicted and updated by introducing Extended Kalman Filtering (EKF), so that the influence of equipment noise on state estimation is effectively reduced. The extended Kalman filter can correct the previous estimation according to the current observation information when predicting the state, so that the state estimation is more accurate. Through the process, errors are remarkably relieved, so that the reinforcement learning can still learn a more accurate strategy under the noise and error interference of equipment, and the effectiveness and stability of the reinforcement learning in practical application are further improved.
2) In reinforcement learning, an experience playback pool stores a plurality of historical experience quadruples, typically for subsequent policy updates. However, in the conventional method, due to inaccurate state estimation, the data in the playback pool may contain noise, resulting in limited contribution to the policy update. According to the application, by combining with the extended Kalman filtering, the state in each track data is accurately estimated and corrected, so that the influence of noise is reduced. The revised state estimate enhances the validity of the data, making the data sampled from the empirical playback pool more valuable for policy optimization. The accurate state estimation accelerates policy convergence in the reinforcement learning process, improves the utilization efficiency of data, shortens learning time, and improves overall optimization performance.
3) Reinforcement learning algorithms are often required to exhibit good adaptability and stability in dynamic, complex environments. However, unavoidable noise and uncertainty in the environment may lead to unstable algorithm behavior. By introducing the extended Kalman filtering, not only is the accuracy of state estimation improved, but also the adaptability of the algorithm to equipment errors and external disturbance is stronger. In a complex physical environment, the algorithm can reduce the influence of noise interference on the optimization process by continuously updating the prediction state and correcting errors thereof. The method can maintain stable performance when facing to environments with higher uncertainty, and improves the robustness of the algorithm, so that the method can be operated efficiently in wider application scenes, and the robustness and reliability of the optimization process are ensured.
The state transition model is established by training the probabilistic neural network, and the state is accurately estimated by combining with the extended Kalman filtering, so that the influence of equipment errors on the state estimation can be reduced, and the accuracy and the stability of strategy learning are improved. Finally, the current strategy is optimized by using DDPG algorithm, new strategy is obtained gradually through multiple iterations, the influence caused by system errors in the equipment parameter tuning process is relieved by using a Kalman filtering and reinforcement learning combined method, and the accuracy of state estimation is improved, so that the strategy learning is more accurate.
The foregoing description is only an overview of the present application, and is intended to provide a better understanding of the present application, as it is embodied in the following description, with reference to the preferred embodiments of the present application and the accompanying drawings.
Detailed Description
The following describes in further detail the embodiments of the present application with reference to the drawings and examples. The following examples are illustrative of the application and are not intended to limit the scope of the application.
Optionally, the beam line station parameter optimization method based on the extended kalman filtering and the reinforcement learning provided by each embodiment is used for illustration in electronic equipment, the electronic equipment is a terminal or a server, the terminal can be a computer, a tablet computer and the like, and the embodiment does not limit the type of the electronic equipment.
Referring to fig. 1, a flow chart of a beam-line station parameter optimization method based on extended kalman filtering and reinforcement learning according to an embodiment of the present application is shown, where the method at least includes the following steps:
step S101, based on an initial strategy and a preset target state, randomly selecting a plurality of initial states from the environment and sampling, and collecting a plurality of pieces of track data consisting of continuous experience quaternions.
Specifically, first, based on a preset target stateRandom selectionInitial states of,. Based on the initial policySampling in the environment, collecting to obtainAnd (3) a track. Wherein each track is composed of a plurality of sequentially ordered empirical quaternionsThe composition of the composite material comprises the components,AndRepresenting the last moment and the current moment in the experience quadruple respectively,Representative ofThe state of the moment of time,Representative ofThe action of the moment in time is that,Representative ofThe rewards of the moment of time,Representative ofA state of time.
The environment is the environment where the beam line station system is located.
And step S102, training the probabilistic neural network by using the collected trajectory data in the first round of sampling to obtain a state transition model.
In implementation, after the first round of sampling is finished and a plurality of pieces of track data are collected, training a preset probabilistic neural network by using the collected plurality of pieces of track data to obtain a state transition model, wherein the state transition model is as follows:
;
Wherein, As a model parameter of the probabilistic neural network,AndRepresenting the current state and the current action, respectively, the state transition model has two outputs, respectively,Is the mean value vector, which is also the expected state vector at the next moment,The variance vector represents the randomness of existence, and the input of the probability neural network is the current stateAnd actionsIn practice, the goal of the probabilistic neural network is to learn the current stateAnd actionsAnd the next stateProbability distribution relation between the two, the model requires the next stateObeying Gaussian distribution in each dimension and outputting average value vectorVariance vector,In Gaussian distribution, the mean vectorThe center position of the distribution is represented and is the most likely value in the distribution. Therefore, the mean vector output by the final state transition model after model training is completed is used as the predicted value of the next state of the system. The variance vector represents uncertainty and is used for reflecting the credibility of the prediction result. For example, when the variance is small, the specification model has a high confidence in the predicted value.
It should be noted that, the probabilistic neural network is an existing model designed in advance according to a specific reinforcement learning task, and the objective is to predict the next state and its uncertainty by inputting the current state and actions. The training process is completed based on the trajectory data collected during the first round of sampling, including the state of the device, the actions taken, and the resulting state. During training, the trajectory data is divided into input and target output by using a standard method of a neural network, and network parameters are adjusted by optimizing a loss function, so that the model can accurately predict a state transition relation. This process ensures that the probabilistic neural network can efficiently represent complex relationships between device states and actions, providing reliable state predictions for subsequent steps. In the implementation, the cost of training the probabilistic neural network is large, so that the state transition model obtained by repeated training in the strategy updating of each round cannot be obtained, and meanwhile, the initial strategy is randomly generated and has strong exploration capacity, and each state and action can be fully traversed. Although the subsequent strategies may be more convergent, the state transition model is sufficient to describe the entire optimization process without repeated updates.
In summary, step S102 successfully builds a state transition model of the system by training the probabilistic neural network. The mean vector as a predictor for the next state simplifies subsequent calculations while maintaining a description of state transition uncertainty. After the model parameters are fixed, the following steps can more efficiently utilize the model to carry out strategy optimization, so that the overall calculation cost is reduced.
And step S103, for each piece of track data, carrying out extended Kalman filtering by combining a state transition model, replacing the filtered next moment state into an experience quadruple of each piece of track data, and storing the new experience quadruple into an experience playback pool.
In step S103, the state at the current time is predicted by the state transition model and the error covariance matrix of the state is calculatedWherein an error covariance matrix is used to predict the uncertainty of the state. And then introducing an observation value and Kalman gain, correcting the predicted state value and the error covariance matrix to obtain a more accurate predicted state value, updating the error covariance matrix after finishing correction, and replacing the corrected predicted state value serving as a next-moment state value into the original experience quadruple to form a new experience quadruple, thereby ensuring more accurate state estimation of each track data and improving the efficiency and quality of subsequent training.
Specifically, the state transition model obtained by training in step S102 is used to predict the current timeThe state of (2) is as follows:
;
for the current moment In this case, the state transition model outputs two values, but the variance is not applied to the predicted values of (a).
In practice, the error covariance requires an initialization operation, in particular, using trajectory data for N consecutive empirical quaternions in the current trajectoryWherein it is assumed thatHas the following componentsThe components of the individual dimensions are such that,Calculate NStandard deviation of each component of (2)The initial value calculation formula of the error covariance is as follows:
;
;
Wherein, Representing the mean value of the first component,Represent the firstStatus in stripe track dataThe covariance matrix of each empirical quaternion observation and predicted value is calculated separately and then averaged to serve as the initial value of the error covariance in the prediction stage。
In practice, the prediction formula for the error covariance is as follows:
;
Wherein, Is a state transition model with respect to stateAt the current momentThe jacobian matrix is used for linearizing a nonlinear process model, and transmitting a state estimation error of the last moment to the current moment to reflect the influence of the state of the last moment on the state of the current moment.Is the error covariance matrix of the update phase at the previous time,Is the covariance matrix of the prediction stage at the current moment, and represents the uncertainty of the state after prediction.Is a process noise covariance matrix, used to describe the additional uncertainty introduced in the prediction process,Representing the transpose operation.
Alternatively, the application uses experience playback pool and adopts experience statistical estimation method to calculateSpecifically, N continuous empirical quaternions in the current track are utilizedCalculating the error of each experience quadruple by using a state transition modelThe following are provided:
;
;
Wherein, . Re-calculating the average error of all empirical quaternionsThe following are provided:
;
Finally, the following formula is used for calculating :
;
Wherein, Indicating the operation of the transpose,Representing the first of the track dataA number of four-tuple samples,AndRepresenting the last time and the current time in the experience quadruple respectively.
In the implementation, after the state and covariance matrix are obtained through prediction, an observation value and a Kalman gain are introduced, and the predicted state value and error covariance are corrected to obtain a more accurate current state predicted value and covariance matrix. Specifically, the difference between the observed value and the predicted value is calculated firstThe following are provided:
;
the Kalman gain is then calculated using the following formula :
;
Wherein, Is a diagonal matrix calculated and constructed by using continuous track data in an experience playback pool in a dynamic estimation mode, and the calculation mode and the initial value of error covarianceSimilarly, N continuous experience samples in the current track are utilized to takeAssume that there isComponents of each dimension, calculating standard deviation of each component,The calculation formula of (2) is as follows:
;
In practice, the correction formula for the predicted state value is as follows:
;
the correction formula of the error covariance matrix is as follows:
;
And after the correction is completed, replacing the corrected prediction state value and the error covariance matrix of each track data into a corresponding experience quadruple to form a new experience quadruple and putting the new experience quadruple into an experience playback pool.
In step S103, by the extended kalman filter process, the state estimation of the trajectory data is more accurate and the quality is higher, and the training efficiency and effect of reinforcement learning are both significantly improved. Meanwhile, the robustness and the adaptability of the system are improved through a dynamic estimation and correction mechanism, so that the algorithm can better cope with complex physical environment and randomness challenges.
Step S104, randomly sampling the experience quadruple from the experience playback pool by using DDPG algorithm, and learning and updating the current strategy to obtain a new strategy, and sequentially cycling until the strategy learning is completed.
In implementation, a depth deterministic strategy gradient algorithm (DDPG algorithm) is employed to optimize the current strategy. DDPG algorithm is a technology combining strategy gradient method and deep reinforcement learning, and can realize efficient learning in continuous action space. Specifically, several empirical quadruples are randomly sampled from an empirical playback pool, gradients of the policy network are calculated using the sampled data, and policy network parameters are updated by back propagation. And updating the value network parameters according to the target value. Through multiple iterations, the current strategy is gradually optimized to obtain a better strategy as a new strategy.
In summary, with reference to fig. 2, the application provides a beam-line station parameter optimization method based on extended kalman filtering and reinforcement learning, which aims to optimize beam-line station parameters and solve the problem that reinforcement learning strategies are difficult to learn effective strategies in sparse rewarding scenes under the condition that equipment errors exist. The state transition model is established by training the probabilistic neural network, and the state is accurately estimated by combining with the extended Kalman filtering, so that the influence of equipment errors on the state estimation can be reduced, and the accuracy and the stability of strategy learning are improved. Finally, the current strategy is optimized by using DDPG algorithm, new strategy is obtained gradually through multiple iterations, the influence caused by system errors in the equipment parameter tuning process is relieved by using a Kalman filtering and reinforcement learning combined method, and the accuracy of state estimation is improved, so that the strategy learning is more accurate.
FIG. 3 is a block diagram of an extended Kalman filtering and reinforcement learning based beam line station parameter optimization system according to one embodiment of the present application, the system at least includes the following modules:
The track data acquisition module is used for randomly selecting a plurality of initial states from the environment and sampling the initial states based on an initial strategy and a preset target state, and collecting a plurality of track data consisting of continuous experience quadruples;
The state transition model generation module is used for training the probabilistic neural network by using the collected track data in the first round of sampling to obtain a state transition model;
The extended Kalman filtering module is used for carrying out extended Kalman filtering on each piece of track data by combining a state transition model, replacing the filtered next time state into an experience quadruple of each piece of track data and storing the new experience quadruple into an experience playback pool;
And the strategy learning module is used for randomly sampling the experience quadruple from the experience playback pool by using DDPG algorithm, learning and updating the current strategy to obtain a new strategy, and circulating until the strategy learning is completed.
For relevant details reference is made to the method embodiments described above.
Fig. 4 is a block diagram of an electronic device provided in one embodiment of the application. The device comprises at least a processor 401 and a memory 402.
Processor 401 may include one or more processing cores such as a 4-core processor, an 8-core processor, etc. The processor 401 may be implemented in at least one hardware form of DSP (DIGITAL SIGNAL Processing), FPGA (Field-Programmable gate array), PLA (Programmable Logic Array ). The processor 401 may also include a main processor, which is a processor for processing data in a wake-up state, also called a CPU (Central Processing Unit ), and a coprocessor, which is a low-power processor for processing data in a standby state. In some embodiments, the processor 401 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 401 may also include an AI (ARTIFICIAL INTELLIGENCE ) processor for processing computing operations related to machine learning.
Memory 402 may include one or more computer-readable storage media, which may be non-transitory. Memory 402 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 402 is used to store at least one instruction for execution by processor 401 to implement the extended kalman filter and reinforcement learning based beam-line station parameter optimization method provided by the method embodiments of the present application.
In some embodiments, the electronic device may also optionally include a peripheral interface and at least one peripheral. The processor 401, memory 402, and peripheral interfaces may be connected by buses or signal lines. The individual peripheral devices may be connected to the peripheral device interface via buses, signal lines or circuit boards. Illustratively, the peripheral devices include, but are not limited to, radio frequency circuitry, touch display screens, audio circuitry, and power supplies, among others.
Of course, the electronic device may also include fewer or more components, as the present embodiment is not limited in this regard.
Optionally, the present application further provides a computer readable storage medium, in which a program is stored, and the program is loaded and executed by a processor to implement the beam line station parameter optimization method based on the extended kalman filtering and reinforcement learning in the above method embodiment.
Optionally, the present application further provides a computer product, where the computer product includes a computer readable storage medium, where a program is stored, and the program is loaded and executed by a processor to implement the beam line station parameter optimization method based on extended kalman filtering and reinforcement learning in the above method embodiment.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.