CN119002288B

CN119002288B - Method, device and medium for improving control accuracy of offline reinforcement learning robot

Info

Publication number: CN119002288B
Application number: CN202411475045.4A
Authority: CN
Inventors: 王杰; 杨睿; 吴国平; 李斌
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2024-10-22
Filing date: 2024-10-22
Publication date: 2025-02-11
Anticipated expiration: 2044-10-22
Also published as: CN119002288A

Abstract

The invention discloses a method, equipment and medium for improving the control accuracy of an offline reinforcement learning robot, belonging to the field of robot control, wherein the method comprises the following steps of 1, acquiring an offline data set containing damaged data and undamaged data; and step 2, training the offline reinforcement learning model of the control robot by using the offline data set by a robust variation Bayesian inference method until the accumulated value of rewards is maximized, and step 3, deploying the offline reinforcement learning model trained in step 2 on the robot to control the robot to complete a preset operation task. According to the method, the Bayesian inference framework is used for capturing uncertainty caused by diversified data damage in the offline data set, so that negative effects of the damaged data on strategies are reduced, robustness and performance of the model in a clean environment are remarkably improved, and accuracy of a robot controlled by the offline reinforcement learning model is improved.

Description

Method, equipment and medium for improving control accuracy of offline reinforcement learning robot

Technical Field

The invention relates to the technical field of robot control, in particular to a method for improving the control accuracy of an offline reinforcement learning robot.

Background

The offline reinforcement learning (Offline Reinforcement Learning) algorithm aims to learn efficient strategies from a fixed dataset that is collected in advance, avoiding real-time interactions with complex environments. This learning paradigm has significant advantages in scenarios where data collection is costly, risky, or difficult to perform in real-time, such as in the healthcare, autopilot, and industrial automation fields. However, due to limitations in data set collection, offline reinforcement learning algorithms often face the core challenge of policy Distribution drift, i.e., there is a Distribution drift between the behavior policies in the offline data set and the policies learned during robot training, which typically results in overestimation of Out-of-Distribution (OOD) actions by the machine, thereby affecting the effectiveness of the learned policies.

For this challenge, existing solutions often measure the uncertainty of the action value function or dynamic model by introducing uncertainty estimation techniques, such as integrated methods based on the action value function or bayesian inference. These techniques limit the machine-learned policies to a degree within the scope of the offline dataset behavior policies, thereby improving the robustness of the policies.

However, in practical applications, offline data sets are often subject to varying degrees of damage due to sensor failures, malicious attacks, and the like. These impairments may manifest themselves as random noise, resistance to attacks or other forms of data perturbation, affecting key elements in the data set such as state, motion, rewards, and transition dynamics. Classical offline reinforcement learning algorithms tend to assume that the data set is clean and intact, so when faced with compromised data, the machine-learned strategies tend to be those in the compromised data, resulting in a significant degradation in machine performance after deployment in a clean environment.

While researchers have made some progress in the field of robust offline reinforcement learning, such as some approaches attempting to mitigate noise or to combat the effects of attacks by enhancing robustness during testing, they have mostly focused on learning on non-offline data sets and defending against attacks during testing, lacking effective methods of dealing with the presence of compromised conditions in training data sets. In addition, current robust offline reinforcement learning methods for corrupted data often only focus on certain specific types of data corruption, such as corruption in state, or transition dynamics, and cannot effectively cope with complex situations where multiple elements of a data set are corrupted at the same time.

In the field of robot control, the existing offline reinforcement learning algorithm often faces the problem of data damage due to sensor faults or malicious attacks and the like, and the damaged data usually introduce high uncertainty, so that the performance of the existing algorithm is greatly reduced during actual deployment, and the accuracy of robot control is further affected.

In view of this, the present invention has been made.

Disclosure of Invention

The invention aims to provide a method, equipment and medium for improving the control accuracy of an offline reinforcement learning robot, which can improve the control accuracy of the robot by using an offline reinforcement learning algorithm, thereby solving the technical problems in the prior art.

The invention aims at realizing the following technical scheme:

a method of improving control accuracy of an offline reinforcement learning robot, comprising:

Step 1, acquiring an offline data set containing damaged data;

Step 2, mapping the offline data set into an action value function, estimating uncertainty values of posterior distribution of the action value function by a variational Bayesian inference method, and adjusting weights used by damaged data based on the uncertainty values to train an offline reinforcement learning model of the control robot until the accumulated value of rewards is maximized;

and 3, deploying the offline reinforcement learning model trained in the step2 on a robot, and controlling the robot to complete a preset operation task.

A processing apparatus, comprising:

At least one memory for storing one or more programs;

At least one processor capable of executing one or more programs stored in the memory, which when executed by the processor, enable the processor to implement the methods of the present invention.

A readable storage medium storing a computer program which, when executed by a processor, is capable of carrying out the method according to the invention.

Compared with the prior art, the method, the device and the medium for improving the control accuracy of the offline reinforcement learning robot have the beneficial effects that:

the offline reinforcement learning model of the control robot is trained by using the offline data set through a robust variation Bayesian inference method until the accumulated value of rewards is maximized, uncertainty caused by diversified data damage in the offline data set is captured through a Bayesian inference framework, all damaged data is modeled as uncertainty of an action value function, posterior distribution of the action value function is approximated through the offline data, and damaged data and undamaged data are distinguished through uncertainty measurement based on entropy, so that influence of the damaged data is regulated in the process of strategy learning, negative effects of the damaged data on strategies are reduced, robustness and performance of the offline reinforcement learning model in a clean environment are remarkably improved, and accuracy of the robot controlled by the offline reinforcement learning model is improved. The method can overcome the limitation of the existing offline reinforcement learning algorithm in the complex situation that a plurality of elements in the data set are damaged at the same time, thereby enhancing the effectiveness of the offline reinforcement learning algorithm in practical application.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a method for improving control accuracy of an offline reinforcement learning robot according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a probability model of a decision process of a method according to an embodiment of the present invention.

FIG. 3 is a graph of performance scores for a method of the present invention tested under random mix corruption in the Lane-v0 task of CARLA simulator.

Fig. 4 is a schematic diagram of a frame of a method for improving control accuracy of an offline reinforcement learning robot according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described in the following in conjunction with the specific contents of the present invention, and it is obvious that the described embodiments are only some embodiments of the present invention, but not all embodiments, which do not limit the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

The terms that may be used herein will first be described as follows:

the term "and/or" is intended to mean that either or both may be implemented, e.g., X and/or Y are intended to include both the cases of "X" or "Y" and the cases of "X and Y".

The terms "comprises," "comprising," "includes," "including," "has," "having" or other similar referents are to be construed to cover a non-exclusive inclusion. For example, inclusion of a feature (e.g., a starting material, component, ingredient, carrier, dosage form, material, size, part, component, mechanism, apparatus, step, procedure, method, reaction condition, processing condition, parameter, algorithm, signal, data, product or article of manufacture, etc.) should be construed as including not only the feature explicitly recited, but also other features known in the art that are not explicitly recited.

The term "consisting of" means excluding any technical feature elements not explicitly listed. If such term is used in a claim, the term will cause the claim to be closed, such that it does not include technical features other than those specifically listed, except for conventional impurities associated therewith. If the term is intended to appear in only a clause of a claim, it is intended to limit only the elements explicitly recited in that clause, and the elements recited in other clauses are not excluded from the overall claim.

Unless specifically stated or limited otherwise, the terms "mounted," "connected," "secured" and the like should be construed broadly, as they are used in a fixed, removable or integral manner, as they are used in a mechanical or electrical connection, as they are used in a direct or indirect connection via an intervening medium, as they are used in a communication between two elements. The specific meaning of the terms herein above will be understood by those of ordinary skill in the art as the case may be.

The terms "center," "longitudinal," "transverse," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counterclockwise," etc. refer to an orientation or positional relationship based on that shown in the drawings, merely for ease of description and to simplify the description, and do not explicitly or implicitly indicate that the apparatus or element in question must have a particular orientation, be constructed and operated in a particular orientation, and therefore should not be construed as limiting the present disclosure.

The following describes the scheme provided by the invention in detail. What is not described in detail in the embodiments of the present invention belongs to the prior art known to those skilled in the art. The specific conditions are not noted in the examples of the present invention and are carried out according to the conditions conventional in the art or suggested by the manufacturer. The reagents or apparatus used in the examples of the present invention were conventional products commercially available without the manufacturer's knowledge.

As shown in fig. 1, an embodiment of the present invention provides a method for improving control accuracy of an offline reinforcement learning robot, including:

Step 1, acquiring an offline data set containing damaged data;

Preferably, in the method, the offline data set including the damaged data acquired in the step 1 includes a damaged motion data subset composed of damaged motion data, a damaged state data subset composed of damaged state data, and a damaged rewards data subset composed of damaged rewards data.

Preferably, in step 2 of the method, mapping the offline data set into an action value function, estimating an uncertainty value of posterior distribution of the action value function by using a variational bayesian inference method, adjusting weights used by the damaged data based on the uncertainty value, and training an offline reinforcement learning model of the control robot until an accumulated value of rewards is maximized, including:

step 21, inputting three kinds of damaged data in the damaged offline data set into a critic network and a value network of an offline reinforcement learning model to generate an action value function Sum functionThe three kinds of damaged data are damaged state dataDamaged motion dataAnd damage rewards data;

Step 22, generating action value functionSum functionInputting the three kinds of damaged data into an integrated model formed by three multi-layer perceptrons in parallel to obtain three kinds of reconstructed data, wherein the three kinds of reconstructed data are reconstructed state dataAction data after reconstructionAnd post-reconstruction reward dataEstimating posterior distribution of an action value function according to the three reconstructed data, and calculating uncertainty values of the posterior distribution of the action value function;

Step 23, adjusting the use weight of damaged data according to the uncertainty value of the posterior distribution of the calculated action value function, and updating the critic network and the integrated model by using three reconstructed data according to the determined use weight of the damaged data;

step 24, the reconstructed state data is processed Inputting the updated critics network to generate a new action value function;

Step 25, utilizing the new action value functionUpdating the actor network;

Step 26, repeating the steps 21 to 25 to train the critics network, the value network and the actor network of the offline reinforcement learning model until the output value of the value network is maximized, and completing the training of the offline reinforcement learning model.

Preferably, in step 22 of the above method, the generated action value function is performed in the following mannerSum functionInputting three kinds of damaged data into an integrated model formed by three multi-layer perceptrons in parallel to obtain three kinds of reconstructed data, wherein the method comprises the following steps:

Using multiple-layer sensing machines Likelihood function modeling corrupted motion dataDeducing posterior distribution of maximized jackpot based on likelihood function of damaged action data by using variational Bayesian to obtain loss function of posterior distribution of maximized jackpot corresponding to damaged action dataIs obtained by using the equation (6) to obtain the motion distribution subject to damage from the equation (6)As observation data to estimate a loss function of a posterior distribution of the jackpot, equation (6) is:

(6);

in the above formula (6), each parameter means: For loss function identification, the representation is based on post-reconstruction action data To optimize the action value functionA loss function of (2); Is a function of modeling action values Parameters of the commentator network; is a multi-layer perceptron for reconstructing damaged motion data; Is a desired function; is an offline dataset; Is the motion value distribution; is KL divergence; Using multi-layer sensing machines Modeling damaged motion data distribution obtained by damaged motion data; Is subject to impaired motion data distribution Is a reconstructed motion data; Is subject to action value distribution Is a new action value function of its parametersA representation; Is subject to the distribution of damaged states in the offline dataset Is a damaged state data of (1); Is subject to impaired reward distribution in offline data sets Is a compromise reward data for (1); is subject to the distribution of damaged state transitions in offline data sets Is (are) damaged state dataThe following damaged-state data; Is a compromised action profile to which compromised action data of the offline data set is subject;

Using multiple-layer sensing machines Likelihood function modeling corrupted bonus dataDeducing posterior distribution of maximized jackpot based on likelihood function of variable decibel leaf to damaged prize data to obtain loss function of posterior distribution of maximized jackpot corresponding to damaged prize dataIs obtained from equation (7) by following the impaired prize distributionAs observation data to estimate a loss function of a posterior distribution of the jackpot, equation (7) is:

(7);

in the above formula (7), each parameter means: identifying for the loss function, the representation is based on reconstructed bonus data To optimize the action value functionA loss function of (2); Is a function of modeling action values Parameters of the commentator network; is a multi-layer perceptron for reconstructing impaired bonus data; Is a desired function; is an offline dataset; Is the motion value distribution; is KL divergence; Using multi-layer sensing machines Modeling a distribution of the compromised reward data derived from the compromised reward data; Is a compromised prize distribution in the offline dataset; Is subject to impaired rewards data distribution Is a reconstructed bonus data; Is subject to action value distribution Is a new action value function of its parametersA representation; Is subject to the distribution of damaged states in the offline dataset Is a damaged state data of (1); is subject to impaired action distribution in offline dataset Is a damaged motion data of the vehicle; Is subject to the distribution of damaged states in the offline dataset Is (are) damaged state dataThe following damaged-state data;

Using multiple-layer sensing machines Likelihood function modeling corrupted state dataDeducing posterior distribution of maximized jackpot based on likelihood function of damaged state data by using variational Bayesian to obtain loss function of posterior distribution of maximized jackpot corresponding to damaged state dataIs obtained from equation (8) by following the damaged state distributionAs observation data to estimate a loss function of a posterior distribution of the jackpot, equation (8) is:

(8);

in the above formula (8), each parameter means: for loss function identification, the representation is based on reconstructed state data To optimize the action value functionA loss function of (2); Is a function of modeling action values Parameters of the commentator network; is a multi-layer perceptron for reconstructing damaged-state data; Is a desired function; is an offline dataset; Is the motion value distribution; is KL divergence; Using multi-layer sensing machines Modeling damaged state data to obtain damaged state data distribution; Is subject to damaged state data distribution Is a reconstructed state data of (1); Is subject to action value distribution Is a new action value function of its parametersA representation; is subject to impaired action distribution in offline dataset Is a damaged motion data of the vehicle; Is subject to impaired reward distribution in offline data sets Is a compromise of the bonus data.

Preferably, in step 22 of the above method, the first term KL divergence in the equation (6) for obtaining the posterior distribution of the maximized jackpot corresponding to the damaged motion data, the first term KL divergence in the equation (7) for obtaining the posterior distribution of the maximized jackpot corresponding to the damaged prize data, and the first term KL divergence in the equation (8) for obtaining the posterior distribution of the maximized jackpot corresponding to the damaged state data are replaced with generalized bayesian inference, respectively, to obtain the reconstructed loss as the damaged dataThe calculation mode of the damaged data reconstruction loss of the first item is as follows in the formula (9):

(9);

In the above formula (9), each parameter means: Is a desired function; is an offline dataset; respectively offline data sets In a damaged state data sample, a damaged motion data sample and a damaged rewards data sample, a new motion value functionObeying action value distribution;Representing corresponding impaired rewards data distributionIs used for the average value of (a),Representing corresponding impaired rewards data distributionIs a standard deviation pair of (2); Representing a corresponding damaged-state data distribution Is used for the average value of (a),Representing a corresponding damaged-state data distributionIs a standard deviation pair of (2); Representing a corresponding compromised action data distribution Is used for the average value of (a),Representing a corresponding compromised action data distributionSuperscript T represents the transpose of the matrix;

using generalized Bayesian inference to maximize the log probability of the second term in equation (6) for finding the posterior distribution of the maximized jackpot corresponding to the compromised action data, the log probability of the second term in equation (7) for finding the posterior distribution of the maximized jackpot corresponding to the compromised reward data, and the log probability of the second term in equation (8) for finding the posterior distribution of the maximized jackpot corresponding to the compromised state data, respectively, as maximum likelihood loss The maximum likelihood loss of the second term is calculated by the following formula (10):

(10);

In the above formula (10), the meaning of each parameter is: Is a desired function; is an offline dataset; respectively offline data sets A compromised state data sample, a compromised motion data sample, and a compromised reward data sample;、、 subject to impaired state data distribution, respectively Impaired motion data distributionImpaired bonus data distributionIs a sample of (2); Using multi-layer sensing machines Reconstructing damaged motion data distribution obtained by damaged motion data; Using multi-layer sensing machines Reconstructing damaged state data distribution obtained by damaged state data; Using multi-layer sensing machines Reconstructing a distribution of the impaired rewards data obtained by the impaired rewards data; is based on A huber function that is a threshold; is a randomly sampled value in the interval 0, 1; And Are all new action value functionsFractional function correspondence of (2)Wherein,Representing the fractional value band gradient information for updating parameters of the motion value function,Representing the quantile value without gradient information, being only a constant, not being used to update the parameters of the action value function;Representing the action value distribution.

Preferably, in step 22 of the above method, the estimating the posterior distribution of the action value function according to the three reconstructed data includes:

Reconstruction loss of the derived corrupted data And maximum likelihood lossEquation (11) for adding the integrated loss function that calculates the posterior distribution of the motion value function from all the damaged data, and updating the parameters of the posterior distribution of the motion value function by the integrated loss functionBased on the parameters againThe posterior distribution of the updated action value function, equation (11) is:

(11);

in the above formula (11), each parameter means: Is a function of modeling action values Parameters of the commentator network; Reconstructing corrupted state data of three corrupted data Is a multi-layer perceptron of (2); Reconstructing impaired action data in three impaired data Is a multi-layer perceptron of (2); Reconstructing the impairment rewards data in the three impairment data Is a multi-layer perceptron of (2); Is a corrupted data reconstruction penalty; is the maximum likelihood loss.

Preferably, in step 22 of the above method, after the posterior distribution of the action value function is updated, the uncertainty value of the posterior distribution of the action value function is calculated by the formula (12) in the following mannerEquation (12) is:

(12);

In the above formula (12), each parameter means: 、 The N is the total sampling times, and a fixed value 32 is used in actual use; is a new action value function Fractional function correspondence of (2)Is a quantile value of (2); is a new action value function Fractional function correspondence of (2)Is a quantile value of (2);, When (when) When the subscript n of (1) it equals the subscript n of 1;Respectively offline data setsA compromised state data sample, a compromised motion data sample, and a compromised reward data sample.

Preferably, in step 23 of the above method, the usage weight of the damaged data is adjusted in such a manner that the usage weight of the damaged data is inversely proportional to the uncertainty value of the posterior distribution of the calculated action value function.

The embodiment of the invention also provides a processing device, which comprises:

At least one memory for storing one or more programs;

at least one processor capable of executing one or more programs stored in the memory, which when executed by the processor, enable the processor to implement the methods described above.

The embodiments of the present invention further provide a readable storage medium storing a computer program which, when executed by a processor, is capable of implementing the method of the present invention.

In summary, according to the method of the embodiment of the invention, the offline reinforcement learning model of the control robot is trained by using the offline data set through a robust variation Bayesian inference method until the accumulated value of rewards is maximized, the uncertainty caused by diversified data damage in the offline data set is captured through the Bayesian inference framework, all damaged data is modeled as the uncertainty of an action value function, the posterior distribution of the action value function is approximated through the offline data, and the damaged data and the undamaged data are distinguished through the uncertainty measurement based on entropy, so that the influence of the damaged data is regulated in the strategy learning process, the negative effect of the damaged data on the strategy is reduced, the robustness and the performance of the model in a clean environment are remarkably improved, and the accuracy of the robot controlled by the offline reinforcement learning model is also improved. The method can overcome the limitation of the existing offline reinforcement learning algorithm in the complex situation that a plurality of elements in the data set are damaged at the same time, thereby enhancing the effectiveness of the offline reinforcement learning algorithm in practical application.

In order to more clearly demonstrate the technical scheme provided by the invention and the technical effects produced by the technical scheme, the scheme provided by the embodiment of the invention is described in detail below by using specific embodiments.

Example 1

As shown in fig. 1 and 4, an embodiment of the present invention provides a method for improving control accuracy of an offline reinforcement learning robot, which captures uncertainty caused by diversified data impairments in an offline data set through a bayesian inference framework, distinguishes impaired data from unimpaired data according to an uncertainty metric, and adjusts the influence of the impaired data in a policy learning process, thereby reducing the negative effect of the impaired data on the policy, significantly improving the robustness and performance of the offline reinforcement learning model in a clean environment, and further improving the robustness and performance of the controlled robot, including the following steps:

Step 1, acquiring an offline data set containing damaged data;

In the step 2, the offline data set is mapped into an action value function, the uncertainty value of posterior distribution of the action value function is estimated by a variational bayesian inference method, the weight used by the damaged data is adjusted based on the uncertainty value, and the offline reinforcement learning model of the control robot is trained until the accumulated value of rewards is maximized, which comprises the following steps:

Step 25, utilizing the new action value functionUpdating the actor network;

Step 26, repeating the steps 21 to 25 to train the critics network, the value network and the actor network of the offline reinforcement learning model until the output value of the value network is maximized, and completing the training of the offline reinforcement learning model. The specific analysis process of the method of the invention is as follows:

Analysis of problems:

The problem of using offline data sets containing corrupted data to train the offline reinforcement learning algorithm of a control robot can be considered to be a general Markov Decision Process (MDP), with tuples The representation, wherein,Is a state space of the device and is a state space,Is the space for the action of the device,Is the bonus space that is available for the user to pay,Is in a state-action pairUnder the condition, the transition probability distribution of the next state,Is the probability distribution of the initial stateIs a discount factor. Note thatAndRespectively represent state spaceAnd an action spaceA set of probability distributions over the subset.

For simplicity, the random variable is denoted below using uppercase letters, and the value of the random variable is denoted by lowercase letters. In particular the number of the elements,The representation follows a distributionRandom variable of single step rewards of (2)Representing this random variableIs a value of (a).

Assume that for any ofRandom variables of a single step prize and their hope to be all determined byAndDefining.

The goal of the present invention is to learn the optimal strategy to maximize the discount jackpot:

(1);

defining a value function based on the jackpot Expectations of action value functions (i.e. jackpot)The action value function (i.e., the jackpot random variable) is:

,

(2);

Here, the 。

In the real world, data collected by sensors or humans may be subject to various impairments due to sensor failures or malicious attacks. By usingAndRepresenting an offline data set and an offline data set, respectively, each comprising a sample. Assuming that the state of the offline dataset follows a state distributionThe state of the offline data set follows a state distributionActions of the offline dataset follow behavior policiesActions from offline data setsMid-sampling, rewarding of offline data sets fromThe next state of the offline data set is sampled fromAnd (3) extracting. The empirical state-action distribution of the offline data set and the empirical state-action distribution of the offline data set are also represented asAnd. Further, for any ofAndThe bellman formula under the introduced corrupted data is:

(3);

consider here Representing the equivalence of probabilities, random variablesDistribution law of (2)Is identical in distribution rule.

The method of the invention involves fitting the distribution of the action value function, thus introducing a distribution reinforcement learning (distributional reinforcement learning) paradigm, transforming the fitting task of the distribution of the action value function into a quantile function (quantile function, also known as an inverse cumulative probability function) learning of the distribution of the action value function. By usingParameterized motion value function (or random variable of jackpot)The action value functionA kind of electronic deviceRepresentation of a dimension-wise bit-number function can be writtenWhereinIs oneDimension vectors, each of which is from uniform distributionAnd (3) sampling. It is to be noted that,Is a jackpot value, the corresponding cumulative probability isIs defined in the figure). Based on time sequence difference learning normal form and quantile regression, obtaining representationThe recursive optimization objective of (a) is:

(4);

Here, the ,Is a function of valueA fractional function representation of (2); Representing a uniform distribution Vector of mid samplesIs the sum of (3); is based on The Huber function, which is the threshold, is the Huber function:

(5);

(II) Bayesian inference based on impairment data:

Considering (1) that multiple types of impairments introduce high uncertainty into all elements of the dataset and (2) that there is a clear correlation (see dashed lines in FIG. 2) between each element and the jackpot (i.e., action value, Q value), estimating the jackpot function (i.e., action value function) using multiple types of impairment data introduces high uncertainty.

In order to deal with such high uncertainty problems caused by a variety of corrupted data (i.e., state, action, rewards, state transition data corruption), it has been proposed to use all corrupted data in an offline dataset as observations based on the probabilistic graphical model shown in fig. 2, and to use the high correlation between these observations and the jackpot to accurately identify the uncertainty of the action value function. FIG. 2 is a probabilistic graphical model of a decision process in which nodes connected by solid lines represent data in a dataset, while Q values (i.e., action values, cumulative rewards) connected by dashed lines do not belong to an offline dataset. These Q values are typically the task targets for which the method of the present invention is intended to estimate.

Based on this idea, the present invention proposes robust variational Bayesian inference (called TRACER) in reinforcement learning to learn uncertainty-aware jackpot representations to capture uncertainty in offline data sets to cope with diverse data corruption situations. In particular, all impairment data is modeled as a jackpot representationTo capture such uncertainty, a variational bayesian inference is introduced, using all off-line impairment data as observations to estimate the posterior distribution of the jackpot.

Redefining an action value function based on FIG. 2In generalThe function is a mean value function,Is a distribution of action value functions. First obeys damaged action distributionStarting with the action data sets of (c) and using them as observations to estimate the posterior distribution of the jackpot. Because the motion data is correlated with the jackpot and all other elements (states, rewards, state transitions), a likelihood function (likelihood) of compromised motion data can be used with a multi-layer perceptronRepresented as. Then, based on the variational Bayesian inference framework, the jackpot is maximizedTo obtain a minimum lower bound of evidence (evidence lower bound, ELBO), equation (6):

(6);

Wherein, 、、Respectively obeys the distribution of damaged statesDamaged prize distributionDistribution of damaged condition;Is the Kullback-Leibler (KL) divergence.

Then, as in equation (6), obey the damaged prize distributionIs used as observation data, and is modeled by a multi-layer perceptronLikelihood function of parameterized impairment rewards dataEquation (7) is obtained:

(7);

Wherein, AndRespectively obeys the distribution of damaged statesAnd impaired motion distribution。

Finally, the application obeys the damaged state distributionAs observation data, for modelingParameterized likelihood functionEquation (8) is obtained:

(8);

Wherein, AndSubject to impaired motion distribution respectivelyAnd a damage prize distribution。

Because there is no provision for、AndThe first term in equation (6), equation (7) and equation (8), the KL divergence term, cannot be directly calculated. Then, the general Bayes inference is introduced to replace two distribution positions of KL divergence in the formulas, so that the task target of the first term which can be optimized, namely damaged data reconstruction loss is approximately obtained, and the method comprises the following steps:

(9);

Wherein, Representing a corresponding distributionThe subscripts a, r, s of the parameters correspond to the parameters of the compromised motion data, the compromised reward data, and the compromised status data, respectively.

In addition, for the second term in the formula (6), the formula (7) and the formula (8), considering that the damaged data can introduce the sample points of heavy tail distribution, applying a robust Huber function based on generalized bayesian inference to maximize the log probability, and obtaining a task target of the second term, namely maximum likelihood loss, is:

(10);

Wherein, 、、Respectively is subject to distributionDistribution ofDistribution ofIs used to reconstruct a sample of the data,Is an offline datasetIs included.

Adding the above equation (9) and equation (10) to obtain equation (11) for calculating the integrated loss function of the posterior distribution of the motion value function from all the damaged data, and updating the parameters of the posterior distribution of the motion value function by the integrated loss functionBased on the parameters againThe posterior distribution of the updated action value function, equation (11) is:

(11);

after obtaining the posterior distribution of the updated action value function, the uncertainty value of the posterior distribution of the action value function can be further estimated by the following formula (12), wherein the formula (12) is:

(12);

Wherein, N is the total sampling times, and a fixed value 32 is used in actual use; Is a function of the action value Fractional function correspondence of (2)Is a quantile value of (2); When n is 1, the number of the n-type switches, Equal to;Respectively corrupted offline data setsA compromised status data sample, a compromised action data sample, and a compromised reward data sample.

Compared with the existing offline reinforcement learning algorithm, the method has the following advantages:

(1) The uncertainty caused by the damage of the data is captured and quantified by using a Bayesian framework, so that the uncertainty in the offline data can be processed more accurately, and the machine is more robust in the face of noise data;

(2) The uncertainty measurement based on entropy is introduced, so that damaged data and undamaged data can be distinguished, a loss function related to the damaged data is adjusted, and the influence of the damaged data on robustness is reduced;

(3) The method is applied to the field of robot control, and the robustness of the offline reinforcement learning algorithm in the face of various data damages is effectively enhanced.

To verify the effectiveness of the method of the present embodiment, the method of the present embodiment is applied to different simulation environments and data sets for testing. These tests are intended to evaluate the robustness and performance of the inventive method in a variety of scenarios. Considering the high cost and inefficiency of training and testing reinforcement learning algorithms in real environments, experiments were chosen in MuJoCo and CARLA simulation environments and the data set provided by D4RL (Datasets for Deep Data-Driven Reinforcement Learning) was employed as the offline data set.

(One) MuJoCo simulation environment:

MuJoCo, which is all called Multi-Joint DYNAMICS WITH Contact, is mainly developed by the teaching of Emo Todorov of the university of Washington, is applied to the fields of optimal control, state estimation, system identification and the like, and has obvious advantages in the application occasions of dynamic multipoint Contact of robots (such as Multi-finger smart hand operation). Unlike other engines that employ robot models such as urdf or sdf, the MuJoCo engine team itself developed a robot modeling format (MJCF) to support more environmental parameter configurations.

The algorithm of the present invention was implemented and tested using mainly the 3 robot control environments provided by MuJoCo in this example. In these task environments, the observable states are different physical quantities (e.g., position, angle, speed, etc.) of various parts (e.g., legs, joints, etc.) of the simulation robot, and the controllable actions are the magnitudes of forces used by specific parts (e.g., legs, head). Specifically, the 3 simulation robot control environments are respectively:

(11) HALFCHEETAH this is a two-dimensional bipedal robot environment, the task being to control the robot to run forward at the fastest speed. The observation space includes physical quantities such as the position and the speed of the robot, and the action space involves controlling the force of the leg joints. Wherein the observation space is 17 dimensions and the action space is 6 dimensions.

(12) Walker2d in this environment, the task is to control a two-dimensional humanoid robot to walk steadily. The observation space contains the physical state of various parts of the robot (such as legs, joints, etc.), and the action space involves the forces applied to these parts. Wherein the observation space is 17 dimensions and the action space is 6 dimensions.

(13) Hopper, which is a robot simulation environment for single foot jumping, is the task of controlling the robot to jump forward on one foot. Since the robot has only one leg, maintaining balance is a major challenge in this task. Wherein the observation space is 11 dimensions and the action space is 3 dimensions.

(II) CARLA simulation environment:

CARLA is a Unreal Engine-based open source simulation platform, which is mainly used for automatic driving research. CARLA environments generally simulate urban driving scenarios, including complex dynamic environments of traffic lights, pedestrians, vehicles, etc. We tested the performance of the TRACER algorithm mainly in CARLA-lane scenarios. CARLA-lane has the task of making lane keeping in an "8" shaped path while avoiding collisions with other vehicles. Wherein the observation space is 48×48×3 dimensions, i.e., the observation data is 48×48 RGB pictures, and the action space is 3 dimensions.

(21) Experiment setting:

Experiments were performed with random or anti-impairment probabilities for all four classes of elements, state, action, rewards, and state transitions, of the offline dataset. Random corruption is the addition of random noise to the data (i.e. sampling noise in a uniform distribution), while anti-corruption is the use of a pre-trained value function as attack noise to the gradient direction of the current data. It should be noted that unlike other types of challenge impairments, challenge rewards are impaired by multiplying the original reward data To achieve the aim of the method, rather than using a gradient. The core of the present invention focuses on random or antagonistic mixed impairments, meaning that random or antagonistic impairments are present in the four elements of the dataset at the same time. In the experiments of the invention, the damage rate is setI.e. random selection of individual elements in an offline dataset% Damage treatment, in the case of mixed damage, a total of 1- # in the datasetData in%) ⁴ are corrupted. In addition, a damaged scale is also providedThis means that the noise compliance zone is at [ ] for random impairments,Uniformly distributed, meaning that the attack noise with gradient is in the interval [ ] against damage,Within the range. In practical experiment use=30、Note that when the impairment ratio was set to 30%, 76% of the total data under mixed impairment was noisy (= 1.0).

And (III) experimental effect display:

The "medium-replay-v2" dataset provided by the D4RL was chosen to verify the robustness of the different algorithms against data corruption problems in offline reinforcement learning. The data set is collected in a manner very close to real world applications, by a SAC (Soft Actor-Critic) trained robot. Random or anti-corruption is implemented on the offline data set to construct data corruption that may be encountered in the real world. The robots are trained using different algorithms based on these impairment data and the trained robots are deployed in a clean test environment to study the robustness of the different algorithms against the impairment data.

For each environment of MuJoCo, we compare TRACER with the most advanced methods before:

(31) BC, an algorithm based on imitation learning;

(32) CQL is an offline reinforcement learning algorithm using a dual Q network;

(33) IQL is an algorithm based on implicit Q learning;

(34) EDAC, an uncertainty offline reinforcement learning algorithm realized by an integrated Q network (the integrated quantity is more than 2);

(35) MSG, an uncertainty method that combines diversity with robustness;

(36) UWMSG a recently proposed offline reinforcement learning algorithm based on combination of uncertainty and robustness;

(37) RIQL A latest offline robust reinforcement learning algorithm for a diversity-oriented offline data set.

Table 1 shows the performance of the method of the present invention (i.e., TRACER) in the presence of multiple data elements simultaneously with damage, which is generalized in a clean, damage-free test environment. The method of the invention obtains obvious performance improvement in all experimental settings, the improvement range reaches +21.1%, and the result shows that the method of the invention has strong robustness to large-scale damaged data.

Table 1 shows the average score and standard error after 3 million robot training in the presence of random mix impairment (random) and challenge mix impairment (advers) of the data, with the highest average performance marked in bold

,

;

Tables 2 and 3 show the average performance of the inventive method TRACER in a clean environment in the presence of corruption of a single type of data element. In both tables, the data corruption is small-scale, accounting for a smaller proportion of the data set, than the hybrid corruption. Under such a scene, the method TRACER of the invention can realize the generalization performance improvement of 14 groups of experiments in 24 experiment settings, obtain the highest average performance score, and can be seen that TRACER is effective for the problem of small-scale and single-class element damage.

Table 2 shows the mean score and standard error after 3 million training of the agent in the presence of random impairment of one element in the data. The highest average performance is marked in bold

,

;

Table 3 shows the mean score and standard error of the agent after 3 million training in the presence of an element in the data against damage. The highest average performance is marked in bold

,

;

In addition, experiments were performed in the autopilot simulator CARLA, which selected the "Lane-v0" task to randomly mix the damaged data, using the parameters c=30, described above,=1.0. Experimental results the performance scores under random mix impairment in the lane-v0 task of the inventive method CARLA simulator shown in fig. 3, under which the inventive method achieved optimal generalization performance in a clean environment.

Example 2

The embodiment provides a method for improving control accuracy of an offline reinforcement learning robot, which comprises the following steps:

The preparation stage:

The offline data set is randomized or attack-resistant to add noise to generate a batch of corrupted data. The robot learns the strategy based on the damaged data, and maximizes the accumulated value of rewards, which is the object of the invention.

Training phase-implementation TRACER based on the framework of fig. 3. The specific training process is as follows:

1) Inputting offline data (after damage) into a critic network and a value network to generate an action value function Value function;

2) Inputting the (damaged) offline data into an integrated model consisting of 3 multi-layer perceptrons for reconstructing the offline data;

3) State of the state Inputting into actor network, generating current learned strategy;

4) The loss function of each network is updated and the training is repeated for a plurality of times.

Verification:

The learning strategy obtained based on the damaged data training is deployed on a robot running in a clean test environment to evaluate the robustness of the learning strategy. And through interaction of the robot and the environment, calculating the total of rewards acquired after a certain number of decisions, and carrying out normalization processing on the rewards as an index of evaluation effect. The higher the average score, the better the robustness of the inventive method.

It will be appreciated by those skilled in the art that all or part of the flow of the method of the above embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and the program may include the flow of the embodiment of the above methods when executed. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random-access Memory (Random Access Memory, RAM), or the like.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims. The information disclosed in the background section herein is only for enhancement of understanding of the general background of the invention and is not to be taken as an admission or any form of suggestion that this information forms the prior art already known to those of ordinary skill in the art.

Claims

1. The method for improving the control accuracy of the offline reinforcement learning robot is characterized by comprising the following steps of:

Step 1, acquiring an offline data set containing damaged data;

In the step 2, the offline data set is mapped into an action value function, the uncertainty value of posterior distribution of the action value function is estimated by a variational bayesian inference method, the weight used by the damaged data is adjusted based on the uncertainty value to train an offline reinforcement learning model of the control robot until the accumulated value of rewards is maximized, and the method comprises the following steps:

Step 21, inputting the three kinds of damaged data in the offline data set into a critic network and a value network of an offline reinforcement learning model to generate an action value function Sum functionThe three kinds of damaged data are damaged state dataDamaged motion dataAnd damage rewards data;

Step 25, utilizing the new action value functionUpdating the actor network;

step 26, repeating the steps 21 to 25 to train the commentator network, the value network and the actor network of the offline reinforcement learning model until the output value of the value network is maximized, and completing the training of the offline reinforcement learning model;

In the step 22, the generated action value function is performed in the following manner Sum functionInputting three kinds of damaged data into an integrated model formed by three multi-layer perceptrons in parallel to obtain three kinds of reconstructed data, wherein the method comprises the following steps:

(6);

(7);

(8);

in the above formula (8), each parameter means: for loss function identification, the representation is based on reconstructed state data To optimize the action value functionA loss function of (2); Is a function of modeling action values Parameters of the commentator network; is a multi-layer perceptron for reconstructing damaged-state data; Is a desired function; is an offline dataset; Is the motion value distribution; is KL divergence; Using multi-layer sensing machines Modeling damaged state data to obtain damaged state data distribution; Is subject to damaged state data distribution Is a reconstructed state data of (1); Is subject to action value distribution Is a new action value function of its parametersA representation; is subject to impaired action distribution in offline dataset Is a damaged motion data of the vehicle; Is subject to impaired reward distribution in offline data sets Is a compromise reward data for (1);

in the step 22, the first term KL divergence in the equation (6) for obtaining the posterior distribution of the maximized jackpot corresponding to the damaged motion data, the first term KL divergence in the equation (7) for obtaining the posterior distribution of the maximized jackpot corresponding to the damaged prize data, and the first term KL divergence in the equation (8) for obtaining the posterior distribution of the maximized jackpot corresponding to the damaged state data are replaced by the generalized bayesian inference, respectively, to obtain the reconstruction loss as the damaged data The calculation mode of the damaged data reconstruction loss of the first item is as follows in the formula (9):

(9);

(10);

In the above formula (10), the meaning of each parameter is: Is a desired function; is an offline dataset; respectively offline data sets A compromised state data sample, a compromised motion data sample, and a compromised reward data sample;、、 subject to impaired state data distribution, respectively Impaired motion data distributionImpaired bonus data distributionIs a sample of (2); Using multi-layer sensing machines Reconstructing damaged motion data distribution obtained by damaged motion data; Using multi-layer sensing machines Reconstructing damaged state data distribution obtained by damaged state data; Using multi-layer sensing machines Reconstructing a distribution of the impaired rewards data obtained by the impaired rewards data; is based on A huber function that is a threshold; is a randomly sampled value in the interval 0, 1; And Are all new action value functionsFractional function correspondence of (2)Wherein,Representing the fractional value band gradient information for updating parameters of the motion value function,Representing the quantile value without gradient information, being only a constant, not being used to update the parameters of the action value function;Representing an action value distribution;

In the step 22, the posterior distribution of the action value function is estimated according to the three reconstructed data in the following manner, including:

Reconstruction loss of the derived corrupted data And maximum likelihood lossSumming the integrated loss function to calculate the posterior distribution of the motion value function using all the impairment dataUpdating parameters of a posterior distribution of the action value function by the integrated loss function (11)Based on the parameters againThe posterior distribution of the updated action value function, equation (11) is:

(11);

in the above formula (11), each parameter means: Is a function of modeling action values Parameters of the commentator network; Reconstructing corrupted state data of three corrupted data Is a multi-layer perceptron of (2); Reconstructing impaired action data in three impaired data Is a multi-layer perceptron of (2); Reconstructing the impairment rewards data in the three impairment data Is a multi-layer perceptron of (2); Is a corrupted data reconstruction penalty; is the maximum likelihood loss;

2. The method for improving control accuracy of an offline reinforcement learning robot according to claim 1, wherein the offline data set including the impairment data acquired in the step 1 includes an impairment motion data subset composed of impairment motion data, an impairment state data subset composed of impairment state data, and an impairment rewards data subset composed of impairment rewards data.

3. The method for improving control accuracy of an offline reinforcement learning robot according to claim 1, wherein in the step 22, after the posterior distribution of the updated action value function, the uncertainty value of the posterior distribution of the action value function is calculated by the formula (12)Equation (12) is:

(12);

4. The method for improving the control accuracy of the offline reinforcement learning robot according to claim 1, wherein in the step 23, the usage weight of the damaged data is adjusted in such a manner that the usage weight of the damaged data is inversely proportional to the uncertainty value of the posterior distribution of the calculated action value function.

5. A processing apparatus, comprising:

At least one memory for storing one or more programs;

At least one processor capable of executing one or more programs stored in the memory, which when executed by the processor, cause the processor to implement the method of any of claims 1-4.

6. A readable storage medium storing a computer program, characterized in that the method of any one of claims 1-4 is implemented when the computer program is executed by a processor.