CN114819093A

CN114819093A - Method and Apparatus for Policy Optimization Using Memristor Array-Based Environment Models

Info

Publication number: CN114819093A
Application number: CN202210497721.2A
Authority: CN
Inventors: 高滨; 林钰登; 唐建石; 吴华强; 张清天; 钱鹤
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2022-05-09
Filing date: 2022-05-09
Publication date: 2022-07-29
Anticipated expiration: 2042-05-09
Also published as: WO2023217027A1; CN114819093B

Abstract

A strategy optimization method and strategy optimization device using a dynamic environment model based on a memristor array. The method includes: acquiring a dynamic environment model based on a memristor array; performing multiple predictions at multiple times according to the dynamic environment model and an object strategy, and obtaining a data sample set including the optimization cost of the object strategy corresponding to multiple times; A collection of samples to perform policy search using the policy gradient optimization algorithm to optimize the object policy. The method uses the dynamic environment model based on the memristor array to generate a data sample set, realizes the long-term dynamic programming based on the dynamic environment model, and then uses a more stable algorithm such as the policy gradient optimization algorithm for policy search, which can effectively optimize the object policy.

Description

Method and Apparatus for Policy Optimization Using Memristor Array-Based Environment Models

技术领域technical field

本公开的实施例涉及一种利用基于忆阻器阵列的动态环境模型的策略优化方法和策略优化装置。Embodiments of the present disclosure relate to a policy optimization method and policy optimization apparatus using a memristor array-based dynamic environment model.

背景技术Background technique

人工神经网络(Artificial Neural Network，ANN)在动态系统的建模中有着广泛的应用。然而，因为缺乏建模不确定性的能力，使用传统的人工神经网络进行长期任务规划仍然是一个挑战。实际系统固有的随机性(不确定性)-过程噪声和数据驱动建模引入的逼近误差会导致人工神经网络的长期估计偏离系统的实际行为。概率模型为解决不确定性问题提供了一种方法，这些模型使人们能够利用模型的预测做出明智的决定，同时对这些预测的不确定性持谨慎态度。Artificial Neural Network (ANN) has a wide range of applications in the modeling of dynamic systems. However, long-term task planning using traditional artificial neural networks remains a challenge due to the lack of ability to model uncertainty. The inherent randomness (uncertainty) of real systems - process noise and approximation errors introduced by data-driven modeling can cause long-term estimates of artificial neural networks to deviate from the actual behavior of the system. Probabilistic models provide a way to address uncertainty, and these models enable people to use the model's predictions to make informed decisions while being cautious about the uncertainty of those predictions.

发明内容SUMMARY OF THE INVENTION

本公开至少一个实施例提供一种利用基于忆阻器阵列的动态环境模型的策略优化方法，包括：获取基于忆阻器阵列的动态环境模型；根据动态环境模型以及对象策略进行多个时刻的多次预测，得到包括对象策略对应于多个时刻的优化代价的数据样本集合；基于数据样本集合，使用策略梯度优化算法进行策略搜索以对对象策略进行优化。At least one embodiment of the present disclosure provides a strategy optimization method using a memristor array-based dynamic environment model, including: acquiring a memristor array-based dynamic environment model; The second prediction is performed to obtain a data sample set including the optimization cost of the target strategy corresponding to multiple moments; based on the data sample set, a policy gradient optimization algorithm is used to perform a strategy search to optimize the target strategy.

例如，在本公开一实施例提供的策略优化方法中，获取动态环境模型，包括：获取贝叶斯神经网络，该贝叶斯神经网络具有经训练得到的权重矩阵；根据贝叶斯神经网络的权重矩阵得到对应的多个目标电导值，将多个目标电导值映射到忆阻器阵列中；将对应于动态系统的时刻t的状态和隐输入变量作为输入信号输入到权重映射后的忆阻器阵列，通过忆阻器阵列对时刻t的状态和隐输入变量按照贝叶斯神经网络进行处理，从忆阻器阵列获取对应于处理结果的输出信号，输出信号用于得到动态系统的时刻t+1的预测结果。For example, in the strategy optimization method provided by an embodiment of the present disclosure, acquiring a dynamic environment model includes: acquiring a Bayesian neural network, where the Bayesian neural network has a weight matrix obtained by training; The weight matrix obtains the corresponding multiple target conductance values, and maps the multiple target conductance values to the memristor array; the state and the latent input variables corresponding to the time t of the dynamic system are input to the weight-mapped memristor as the input signal. Through the memristor array, the state and hidden input variables at time t are processed according to the Bayesian neural network, and the output signal corresponding to the processing result is obtained from the memristor array, and the output signal is used to obtain the time t of the dynamic system. +1 for the predicted results.

例如，在本公开一实施例提供的策略优化方法中，动态环境模型的表达为s_t+1＝f(s_t,a_t；W,ε)，s_t是动态系统的时刻t的状态，a_t是对象策略在时刻t的动作，W是贝叶斯神经网络的权重矩阵，ε是对应于忆阻器阵列的加性噪声，s_t+1是动态系统的时刻t+1的预测结果；对象策略在时刻t的动作a_t＝π(s_t；Wπ)，π表示对象策略的函数，Wπ表示策略参数，贝叶斯神经网络的权重矩阵W满足分布W～q(W)，加性噪声ε为加性高斯噪声ε～N(0,σ²)。For example, in the strategy optimization method provided by an embodiment of the present disclosure, the expression of the dynamic environment model is s _t ₊₁ =f(s _t , at ; W, ε), where s _t is the state of the dynamic system at time t, a _t is the action of the object policy at time t, W is the weight matrix of the Bayesian neural network, ε is the additive noise corresponding to the memristor array, and s _t+1 is the prediction result of the dynamic system at time t+1 ; the action of the target strategy at time t at _t = π(s _t ; Wπ), π represents the function of the target strategy, Wπ represents the strategy parameter, the weight matrix W of the Bayesian neural network satisfies the distribution W～q(W), plus The additive noise ε is the additive Gaussian noise ε～N(0,σ ² ).

例如，在本公开一实施例提供的策略优化方法中，多个时刻包括从早到晚依序排列的时刻1到时刻T，根据动态环境模型以及对象策略进行多个时刻的多次预测，得到包括对象策略对应于多个时刻的优化代价的数据样本集合，包括：对于时刻1到时刻T中的任一时刻t-1，由对象策略获得执行动作a_t-1，由a_t-1＝π(s_t-1；Wπ)得到对象策略在时刻t-1的动作a_t-1；根据动态环境模型s_t＝f(s_t-1,a_t-1；W,ε)计算得到时刻t-1之后的下一时刻t的状态s_t并获得对应于时刻t的状态s_t的代价c_t，由此得到从时刻1到时刻t的代价序列{c₁,c₂,…,c_t}，基于代价序列获得时刻t的优化代价J_t-1，其中，1≤t≤T；得到时刻1到时刻T的数据样本集合{[a₀,J₀],…,[a_T-1,J_T-1]}。For example, in the strategy optimization method provided by an embodiment of the present disclosure, the multiple times include time 1 to time T arranged in order from early to late, and multiple predictions are performed at multiple times according to the dynamic environment model and the object strategy to obtain The data sample set including the optimization cost of the object strategy corresponding to multiple times, including: for any time t-1 from time 1 to time T, the execution action a _t-1 obtained by the object strategy, by a _t-1 = π(s _t-1 ; Wπ) obtains the action a _{t-1 of the object strategy at time t-1} ; according to the dynamic environment model s _t =f(s _t-1 , at _-1 ; W, ε), the time is calculated state s _t at the next time t after t-1 and obtain the cost ct corresponding to the state s _t at time _t , thereby obtaining the cost sequence {c ₁ ,c ₂ ,...,c from time 1 to time t _t }, obtain the optimization cost J _t-1 at time t based on the cost sequence, where 1≤t≤T; obtain the data sample set from time 1 to time T {[a ₀ ,J ₀ ],…,[a _{T- 1} , J _T-1 ]}.

例如，在本公开一实施例提供的策略优化方法中，在时刻t上的代价c_t的期望值为E[c_t]，则时刻t的优化代价可以通过

来获得。For example, in the policy optimization method provided by an embodiment of the present disclosure, the expected value of the cost c _t at time t is E[c _t ], then the optimization cost at time t can be obtained by

to obtain.

例如，在本公开一实施例提供的策略优化方法中，代价还包括偶然不确定性带来的代价变化和认知不确定性带来的代价变化，偶然不确定性是由隐输入变量引起的，认知不确定性是由忆阻器阵列的本征噪声引起的。For example, in the strategy optimization method provided by an embodiment of the present disclosure, the cost also includes the cost change caused by accidental uncertainty and the cost change caused by cognitive uncertainty, and the accidental uncertainty is caused by hidden input variables , the epistemic uncertainty is caused by the intrinsic noise of the memristor array.

例如，在本公开一实施例提供的策略优化方法中，时刻t的优化代价通过

来获得，σ(η,θ)为偶然不确定性和认知不确定性的函数，η表示偶然不确定性，θ表示认知不确定性。For example, in the policy optimization method provided by an embodiment of the present disclosure, the optimization cost at time t passes through

to obtain, σ(η, θ) is a function of contingent uncertainty and epistemic uncertainty, where η represents contingent uncertainty and θ represents epistemic uncertainty.

例如，在本公开一实施例提供的策略优化方法中，根据动态环境模型s_t＝f(s_t-1,a_t-1；W,ε)计算得到时刻t的状态s_t并获得对应于时刻t的状态s_t的代价c_t，包括：对隐输入变量z从p(z)分布中采样得到样本；将样本和t-1时刻的状态s_t-1输入到权重映射后的忆阻器阵列得到预测状态s_t；对于预测状态s_t，获得代价c_t＝c(s_t)。For example, in the strategy optimization method provided by an embodiment of the present disclosure, the state s _t at time t is calculated and obtained according to the dynamic environment model s _t =f(s _t-1 , at _-1 ; W, ε) and the corresponding The cost c _t of the state s _t at time t includes: sampling the latent input variable z from the p(z) distribution; inputting the samples and the state s _{t-1 at time t-1} into the memristor after weight mapping For the predicted state s _t , the cost _ct = c(s _t ) _is obtained.

例如，在本公开一实施例提供的策略优化方法中，策略梯度优化算法包括REINFORCE算法、PRO算法或TPRO算法。For example, in the policy optimization method provided by an embodiment of the present disclosure, the policy gradient optimization algorithm includes the REINFORCE algorithm, the PRO algorithm, or the TPRO algorithm.

本公开至少一个实施例还提供一种利用基于忆阻器阵列的动态环境模型的策略优化装置，包括：获取单元，配置为获取基于忆阻器阵列的动态环境模型；计算单元，配置为根据动态环境模型以及对象策略进行多个时刻的多次预测，得到包括对象策略对应于多个时刻的优化代价的数据样本集合；策略搜索单元，配置为基于数据样本集合，使用策略梯度优化算法进行策略搜索以对对象策略进行优化。At least one embodiment of the present disclosure also provides a strategy optimization device using a memristor array-based dynamic environment model, including: an acquisition unit configured to acquire the memristor array-based dynamic environment model; The environment model and the object strategy perform multiple predictions at multiple times, and obtain a data sample set including the optimization cost of the object strategy corresponding to multiple times; the strategy search unit is configured to use the strategy gradient optimization algorithm to search for the strategy based on the data sample set. to optimize the object strategy.

附图说明Description of drawings

为了更清楚地说明本公开实施例的技术方案，下面将对实施例的附图作简单地介绍，显而易见地，下面描述中的附图仅仅涉及本公开的一些实施例，而非对本公开的限制。In order to explain the technical solutions of the embodiments of the present disclosure more clearly, the accompanying drawings of the embodiments will be briefly introduced below. Obviously, the drawings in the following description only relate to some embodiments of the present disclosure, rather than limit the present disclosure. .

图1A示出了本公开至少一实施例提供的一种基于忆阻器阵列的动态环境模型的策略优化方法的示意性流程图；1A shows a schematic flowchart of a strategy optimization method based on a dynamic environment model of a memristor array provided by at least one embodiment of the present disclosure;

图1B示出了图1A中步骤S101的示意性流程图；Fig. 1B shows a schematic flowchart of step S101 in Fig. 1A;

图2A示出了一种忆阻器阵列的示意性结构；FIG. 2A shows a schematic structure of a memristor array;

图2B为一种忆阻器装置的示意图；2B is a schematic diagram of a memristor device;

图2C为另一种忆阻器装置的示意图；2C is a schematic diagram of another memristor device;

图2D示出了将贝叶斯神经网络的权重矩阵映射到忆阻器阵列的示意图；Figure 2D shows a schematic diagram of mapping the weight matrix of a Bayesian neural network to a memristor array;

图3示出了图1A中步骤S102的示意性流程图；Fig. 3 shows a schematic flowchart of step S102 in Fig. 1A;

图4示出了本公开至少一个实施例提供的策略优化方法的一个示例的示意图；FIG. 4 shows a schematic diagram of an example of a policy optimization method provided by at least one embodiment of the present disclosure;

图5示出了本公开至少一实施例提供的一种利用基于忆阻器阵列的动态环境模型的策略优化装置的示意框图。FIG. 5 shows a schematic block diagram of a strategy optimization apparatus using a dynamic environment model based on a memristor array provided by at least one embodiment of the present disclosure.

具体实施方式Detailed ways

为使本公开实施例的目的、技术方案和优点更加清楚，下面将结合本公开实施例的附图，对本公开实施例的技术方案进行清楚、完整地描述。显然，所描述的实施例是本公开的一部分实施例，而不是全部的实施例。基于所描述的本公开的实施例，本领域普通技术人员在无需创造性劳动的前提下所获得的所有其他实施例，都属于本公开保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present disclosure. Obviously, the described embodiments are some, but not all, embodiments of the present disclosure. Based on the described embodiments of the present disclosure, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the protection scope of the present disclosure.

除非另外定义，本公开使用的技术术语或者科学术语应当为本公开所属领域内具有一般技能的人士所理解的通常意义。本公开中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性，而只是用来区分不同的组成部分。同样，“一个”、“一”或者“该”等类似词语也不表示数量限制，而是表示存在至少一个。“包括”或者“包含”等类似的词语意指出现该词前面的元件或者物件涵盖出现在该词后面列举的元件或者物件及其等同，而不排除其他元件或者物件。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接，而是可以包括电性的连接，不管是直接的还是间接的。“上”、“下”、“左”、“右”等仅用于表示相对位置关系，当被描述对象的绝对位置改变后，则该相对位置关系也可能相应地改变。Unless otherwise defined, technical or scientific terms used in this disclosure shall have the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. As used in this disclosure, "first," "second," and similar terms do not denote any order, quantity, or importance, but are merely used to distinguish the various components. Likewise, words such as "a," "an," or "the" do not denote a limitation of quantity, but rather denote the presence of at least one. "Comprises" or "comprising" and similar words mean that the elements or things appearing before the word encompass the elements or things recited after the word and their equivalents, but do not exclude other elements or things. Words like "connected" or "connected" are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "Up", "Down", "Left", "Right", etc. are only used to represent the relative positional relationship, and when the absolute position of the described object changes, the relative positional relationship may also change accordingly.

在无模型的深度强化学习(Deep Reinforcement Learning)中，智能体(Agent)通常需要与真实环境进行大量交互试错，数据效率不高，因此无法应用在试错代价比较大的真实任务中。基于模型的深度强化学习则可以更加高效地利用数据。在基于模型的深度强化学习中，智能体首先从与真实动态环境交互的历史经验(例如事先收集得到的状态转移数据)中学习得到动态环境模型，然后与动态环境模型交互进而得到次优化的策略。In model-free deep reinforcement learning (Deep Reinforcement Learning), the agent usually needs to perform a lot of interactive trial and error with the real environment, and the data efficiency is not high, so it cannot be applied to real tasks with high trial and error costs. Model-based deep reinforcement learning can use data more efficiently. In model-based deep reinforcement learning, the agent first learns the dynamic environment model from the historical experience of interacting with the real dynamic environment (such as state transition data collected in advance), and then interacts with the dynamic environment model to obtain suboptimal policies. .

基于模型的强化学习方法学习到一个精准的动态环境模型的情形，训练智能体时使用这个模型，不需要与真实环境交互太多次，智能体可以“想象”在真实环境中进行互动的感觉，能极大地提高数据效率，适用在获取数据的成本较高的实际物理场景中；同时，动态环境模型可以预测环境的未知状态，泛化智能体的认知，也可以作为新的数据源，提供上下文信息来帮助决策，进而可以缓解探索-利用困境。在对实际环境建模时，环境固有的随机性(不确定性)-过程噪声和数据驱动建模引入的逼近误差会导致人工神经网络的长期估计偏离系统的实际行为。概率模型为解决不确定性问题提供了一种方法，这些模型使得能够利用模型的预测做出明智的决定，同时对这些预测的不确定性持谨慎态度。The model-based reinforcement learning method learns an accurate dynamic environment model. Using this model when training the agent does not need to interact with the real environment too many times. The agent can "imagine" the feeling of interacting in the real environment. It can greatly improve data efficiency and is suitable for practical physical scenarios where the cost of data acquisition is high; at the same time, the dynamic environment model can predict the unknown state of the environment, generalize the cognition of the agent, and can also be used as a new data source to provide Contextual information to aid decision-making, which in turn can alleviate the exploration-exploitation dilemma. When modeling a real environment, the inherent randomness (uncertainty) of the environment - process noise and approximation errors introduced by data-driven modeling can cause the long-term estimates of artificial neural networks to deviate from the actual behavior of the system. Probabilistic models provide a way to address uncertainty, and these models enable making informed decisions using the model's predictions while being cautious about the uncertainty of those predictions.

发明人发现：贝叶斯神经网络(Bayesian Neural Network,BNN)是一种将神经网络置于贝叶斯框架中的概率模型，可以描述复杂的随机模式；并且，带隐输入变量的贝叶斯神经网络(BNN with latent input variables,BNN+LV)可以通过隐输入变量上的分布(偶然不确定性)来描述复杂的随机模式，同时通过权重上的分布(认知不确定性)来考虑模型的不确定性。隐输入变量是指不能被直接观测到，但是对概率模型的状态和输出存在影响的一种变量。发明人在中国发明专利申请公开CN110956256A中描述了利用忆阻器本征噪声实现贝叶斯神经网络的方法及装置，在此全文引用以作为本申请的一部分。The inventor found that: Bayesian Neural Network (BNN) is a probabilistic model that puts neural network in a Bayesian framework, which can describe complex random patterns; and, Bayesian with hidden input variables Neural network (BNN with latent input variables, BNN+LV) can describe complex random patterns through the distribution on latent input variables (accidental uncertainty), while considering the model through the distribution on weights (cognitive uncertainty) of uncertainty. A latent input variable is a variable that cannot be directly observed, but has an impact on the state and output of a probability model. The inventor described a method and device for implementing a Bayesian neural network using memristor intrinsic noise in Chinese Patent Application Publication CN110956256A, which is incorporated herein by reference in its entirety as a part of this application.

贝叶斯神经网络的结构包括但不限于全连接结构、卷积神经网络(ConvolutionalNeural Network,CNN)结构等，其网络权值W是基于一定分布的随机变量(W～q(W))。The structure of the Bayesian neural network includes but is not limited to a fully connected structure, a convolutional neural network (Convolutional Neural Network, CNN) structure, etc., and its network weight W is a random variable (W~q(W)) based on a certain distribution.

发明人进一步发现，假设已有用于贝叶斯神经网络的动态系统的数据集D＝{X,Y}，其中X是动态系统的状态特征向量，Y是动态系统的下一个状态。该贝叶斯神经网络的输入即是动态系统的状态特征向量X和隐输入变量z(z～p(z))；该贝叶斯神经网络的参数可以训练；而且，贝叶斯神经网络的输出叠加上独立加性高斯噪声ε(ε～N(0,σ²))即是对动态系统下一个状态预测y，即y＝f(X,z,W,ε)。由此，在训练完成后贝叶斯神经网络的每一个权重都是一个分布。例如，每个权重都是彼此独立的分布。The inventors further found that, assuming that there is a data set D={X, Y} of the dynamic system for the Bayesian neural network, where X is the state feature vector of the dynamic system, and Y is the next state of the dynamic system. The input of the Bayesian neural network is the state feature vector X of the dynamic system and the hidden input variable z (z～p(z)); the parameters of the Bayesian neural network can be trained; The independent additive Gaussian noise ε(ε～N(0,σ ² )) superimposed on the output is to predict y for the next state of the dynamic system, that is, y=f(X, z, W, ε). Thus, each weight of the Bayesian neural network is a distribution after training. For example, each weight is a distribution independent of each other.

在长期规划的任务中，梯度将经过多步的反向传播，存在梯度消失和爆炸问题；同时，直接基于忆阻器阵列实现的神经网络在进行策略搜索时，由于忆阻器的本征随机特性，梯度经由忆阻器阵列反向传播时会引入额外的噪声，这些带噪声的梯度无法有效地优化策略搜索。In the long-term planning task, the gradient will go through multi-step backpropagation, and there are problems of gradient disappearance and explosion; at the same time, when the neural network directly based on the memristor array performs policy search, due to the inherent randomness of the memristor However, the back-propagation of gradients through the memristor array introduces additional noise, and these noisy gradients cannot effectively optimize policy search.

本公开上述实施例提供的策略优化方法利用基于忆阻器阵列的动态环境模型来生成数据样本集合，实现基于动态环境模型的长期动态规划，然后使用策略梯度优化算法等更加稳定的算法进行策略搜索，无梯度消失和爆炸问题，能够有效地优化对象策略。The strategy optimization method provided by the above embodiments of the present disclosure utilizes a dynamic environment model based on a memristor array to generate a data sample set, implements long-term dynamic programming based on the dynamic environment model, and then uses a more stable algorithm such as a policy gradient optimization algorithm to perform policy search , without the vanishing and exploding gradient problem, and can effectively optimize the object policy.

本公开至少一实施例还提供对应于上述策略优化方法的策略优化装置。At least one embodiment of the present disclosure further provides a strategy optimization apparatus corresponding to the above strategy optimization method.

下面结合附图对本公开的实施例进行详细说明，但是本公开并不限于这些具体的实施例。The embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings, but the present disclosure is not limited to these specific embodiments.

图1A示出了本公开至少一实施例提供的一种基于忆阻器阵列的动态环境模型的策略优化方法的示意性流程图。FIG. 1A shows a schematic flowchart of a strategy optimization method based on a dynamic environment model of a memristor array provided by at least one embodiment of the present disclosure.

如图1A所示，该策略优化方法包括如下的步骤S101～S103。As shown in FIG. 1A, the strategy optimization method includes the following steps S101-S103.

步骤S101：获取基于忆阻器阵列的动态环境模型。Step S101: Obtain a dynamic environment model based on the memristor array.

在本公开的实施例中，例如，可以利用基于忆阻器阵列的BNN+LV对动态系统进行建模得到动态环境模型，对此具体的步骤将在图1B中示出，在此不再赘述。In the embodiment of the present disclosure, for example, the dynamic environment model can be obtained by modeling the dynamic system by using the BNN+LV based on the memristor array. The specific steps will be shown in FIG. 1B , and will not be repeated here .

步骤S102：根据动态环境模型以及对象策略进行多个时刻的多次预测，得到包括对象策略对应于多个时刻的优化代价的数据样本集合。Step S102: Perform multiple predictions at multiple times according to the dynamic environment model and the object strategy, and obtain a data sample set including the optimization cost of the object strategy corresponding to the multiple times.

例如，所涉及的对象策略用于深度强化学习，例如，可以是智能体在与环境的交互过程中达成回报最大化或实现特定目标的策略。For example, the object policy involved is used in deep reinforcement learning, and can be, for example, the agent's policy to maximize rewards or achieve specific goals during interactions with the environment.

步骤S103：基于数据样本集合，使用策略梯度优化算法进行策略搜索以对对象策略进行优化。Step S103: Based on the data sample set, use a policy gradient optimization algorithm to perform policy search to optimize the target policy.

例如，在本公开的实施例的不同示例中，策略梯度优化算法可以包括REINFORCE算法、PRO(Proximal Policy Optimization)算法或TPRO(Trust Region PolicyOptimization)算法。在本公开的实施例中，这些策略梯度优化方法更加稳定，可以有效地优化对象策略。For example, in different examples of embodiments of the present disclosure, the policy gradient optimization algorithm may include the REINFORCE algorithm, the PRO (Proximal Policy Optimization) algorithm, or the TPRO (Trust Region Policy Optimization) algorithm. In the embodiments of the present disclosure, these policy gradient optimization methods are more stable and can effectively optimize the target policy.

图1B示出了图1A中步骤S101的示例的示意性流程图。FIG. 1B shows a schematic flowchart of an example of step S101 in FIG. 1A .

如图1B所示，步骤S101的示例可以包括如下的步骤S111～S113。As shown in FIG. 1B , an example of step S101 may include steps S111 to S113 as follows.

步骤S111：获取贝叶斯神经网络，其中，贝叶斯神经网络具有经训练得到的权重矩阵。Step S111: Obtain a Bayesian neural network, wherein the Bayesian neural network has a weight matrix obtained by training.

例如，贝叶斯神经网络的结构包括全连接结构或卷积神经网络结构等。该贝叶斯神经网络的每个网络权重是随机变量。例如，在该贝叶斯神经网络经训练完成后，每一个权重都是一个分布，例如高斯分布或者拉普拉斯分布。For example, the structure of Bayesian neural network includes fully connected structure or convolutional neural network structure. Each network weight of this Bayesian neural network is a random variable. For example, after the Bayesian neural network is trained, each weight is a distribution, such as a Gaussian distribution or a Laplacian distribution.

例如，可以对贝叶斯神经网络进行离线(offline)训练得到权重矩阵，对贝叶斯神经网络进行训练的方法可以参考常规方法，例如可以采用中央处理单元(CPU)、图像处理单元(GPU)、神经网络处理单元(NPU)等进行训练，在此不再赘述。For example, the weight matrix can be obtained by offline training of the Bayesian neural network, and the method of training the Bayesian neural network can refer to conventional methods, such as central processing unit (CPU), image processing unit (GPU) , a neural network processing unit (NPU), etc. for training, which will not be repeated here.

步骤S112：根据贝叶斯神经网络的权重矩阵得到对应的多个目标电导值，将多个目标电导值映射到忆阻器阵列中。Step S112 : obtaining corresponding multiple target conductance values according to the weight matrix of the Bayesian neural network, and mapping the multiple target conductance values to the memristor array.

在贝叶斯神经网络训练完成得到权重矩阵后，对权重矩阵进行处理以得到对应的多个目标电导值。例如，在该过程中，可以对权重矩阵进行偏置和放缩，直至权重矩阵满足用于所使用的忆阻器阵列的合适的电导窗口。对权重矩阵进行偏置和放缩处理后，根据处理后的权重矩阵和忆阻器的电导值计算目标电导值。具体的计算目标电导值的过程可以参考基于忆阻器的贝叶斯神经网络的相关描述，在此不再赘述。After the training of the Bayesian neural network is completed to obtain the weight matrix, the weight matrix is processed to obtain corresponding multiple target conductance values. For example, in this process, the weight matrix can be biased and scaled until the weight matrix meets an appropriate conductance window for the memristor array used. After biasing and scaling the weight matrix, the target conductance value is calculated according to the processed weight matrix and the conductance value of the memristor. For the specific process of calculating the target conductance value, reference may be made to the relevant description of the memristor-based Bayesian neural network, which will not be repeated here.

图2A示出了一种忆阻器阵列的示意性结构，该忆阻器阵列例如由多个忆阻器单元构成，该多个忆阻器单元构成一个M行N列的阵列，M和N均为正整数。每个忆阻器单元包括开关元件和一个或多个忆阻器。在图2A中，WL<1>、WL<2>……WL<M>分别表示第一行、第二行……第M行的字线，每一行的忆阻器单元电路中的开关元件的控制极(例如晶体管的栅极)和该行对应的字线连接；BL<1>、BL<2>……BL<N>分别表示第一列、第二列……第N列的位线，每列的忆阻器单元电路中的忆阻器和该列对应的位线连接；SL<1>、SL<2>……SL<M>分别表示第一行、第二行……第M行的源线，每一行的忆阻器单元电路中的晶体管的源极和该行对应的源线连接。根据基尔霍夫定律，通过设置忆阻器单元的状态(例如阻值)并且在字线与位线施加相应的字线信号与位线信号，上述忆阻器阵列可以并行地完成乘累加计算。FIG. 2A shows a schematic structure of a memristor array, for example, the memristor array is composed of a plurality of memristor cells, the plurality of memristor cells form an array of M rows and N columns, M and N All are positive integers. Each memristor cell includes a switching element and one or more memristors. In FIG. 2A, WL<1>, WL<2>...WL<M> represent the word lines of the first row, the second row...the Mth row, respectively, and the switching elements in the memristor cell circuit of each row The control electrode (such as the gate of the transistor) is connected to the corresponding word line of the row; BL<1>, BL<2>...BL<N> represent the bits of the first column, the second column...the Nth column, respectively Line, the memristor in the memristor unit circuit of each column is connected to the corresponding bit line of the column; SL<1>, SL<2>...SL<M> represent the first row, the second row... For the source lines of the Mth row, the sources of the transistors in the memristor unit circuits of each row are connected to the corresponding source lines of the row. According to Kirchhoff's law, by setting the state of the memristor cell (such as resistance value) and applying the corresponding word line signal and bit line signal to the word line and bit line, the above-mentioned memristor array can complete the multiply-accumulate calculation in parallel .

图2B为一种忆阻器装置的示意图，该忆阻器装置包括忆阻器阵列及其外围驱动电路。例如，如图2B所示，该忆阻器装置包括信号获取装置、字线驱动电路、位线驱动电路、源线驱动电路、忆阻器阵列以及数据输出电路。FIG. 2B is a schematic diagram of a memristor device including a memristor array and a peripheral driving circuit thereof. For example, as shown in FIG. 2B, the memristor device includes a signal acquisition device, a word line driver circuit, a bit line driver circuit, a source line driver circuit, a memristor array, and a data output circuit.

例如，信号获取装置配置为将数字信号通过数字模拟转换器(Digital to Analogconverter，简称DAC)转换为多个模拟信号，以输入至忆阻器阵列的多个列信号输入端。For example, the signal acquisition device is configured to convert a digital signal into a plurality of analog signals through a digital to analog converter (DAC) for inputting to a plurality of column signal input terminals of the memristor array.

例如，忆阻器阵列包括M条源线、M条字线和N条位线，以及阵列排布为M行N列的多个忆阻器单元。For example, a memristor array includes M source lines, M word lines, and N bit lines, and a plurality of memristor cells arranged in M rows and N columns.

例如，通过字线驱动电路、位线驱动电路和源线驱动电路实现对于忆阻器阵列的操作。For example, operation of the memristor array is accomplished by word line driver circuits, bit line driver circuits, and source line driver circuits.

例如，字线驱动电路包括多个多路选择器(Multiplexer，简称Mux)，用于切换字线输入电压；位线驱动电路包括多个多路选择器，用于切换位线输入电压；源线驱动电路也包括多个多路选择器(Mux)，用于切换源线输入电压。例如，源线驱动电路还包括多个ADC，用于将模拟信号转换为数字信号。此外，在源线驱动电路中的Mux和ADC之间还可以进一步设置跨阻放大器(Trans-Impedance Amplifier，简称TIA)(图中未示出)以完成电流到电压的转换，以便于ADC处理。For example, the word line driver circuit includes a plurality of multiplexers (Multiplexer, Mux for short) for switching the word line input voltage; the bit line driver circuit includes a plurality of multiplexers for switching the bit line input voltage; the source line The driver circuit also includes a plurality of multiplexers (Mux) for switching the source line input voltage. For example, the source line driver circuit also includes a plurality of ADCs for converting analog signals into digital signals. In addition, a Trans-Impedance Amplifier (TIA) (not shown in the figure) may be further set between the Mux and the ADC in the source line driver circuit to complete the current-to-voltage conversion for the ADC to process.

例如，忆阻器阵列包括操作模式和计算模式。当忆阻器阵列处于操作模式时，忆阻器单元处于初始化状态，可以将参数矩阵中的参数元素的数值写入忆阻器阵列中。例如，将忆阻器的源线输入电压、位线输入电压和字线输入电压通过多路选择器切换至对应的预设电压区间。For example, a memristor array includes an operating mode and a computing mode. When the memristor array is in the operating mode, the memristor cell is in an initialization state, and the values of the parameter elements in the parameter matrix can be written into the memristor array. For example, the source line input voltage, bit line input voltage and word line input voltage of the memristor are switched to corresponding preset voltage ranges through multiplexers.

例如，通过图2B中的字线驱动电路中的多路选择器的控制信号WL_sw[1:M]将字线输入电压切换至相应的电压区间。例如在对忆阻器进行置位操作时，将字线输入电压设置为2V(伏特)，例如在对忆阻器进行复位操作时，将字线输入电压设置为5V，例如，字线输入电压可以通过图2B中的电压信号V_WL[1:M]得到。For example, the word line input voltage is switched to the corresponding voltage range through the control signals WL_sw[1:M] of the multiplexer in the word line driving circuit in FIG. 2B . For example, when the memristor is set, the word line input voltage is set to 2V (volts), for example, when the memristor is reset, the word line input voltage is set to 5V, for example, the word line input voltage It can be obtained by the voltage signals V_WL[1:M] in FIG. 2B.

例如，通过图2B中的源线驱动电路中的多路选择器的控制信号SL_sw[1:M]将源线输入电压切换至相应的电压区间。例如在对忆阻器进行置位操作时，将源线输入电压设置为0V，例如在对忆阻器进行复位操作时，将源线输入电压设置为2V，例如，源线输入电压可以通过图2B中的电压信号V_SL[1:M]得到。For example, the source line input voltage is switched to a corresponding voltage range through the control signals SL_sw[1:M] of the multiplexer in the source line driving circuit in FIG. 2B . For example, when the memristor is set, the source line input voltage is set to 0V. For example, when the memristor is reset, the source line input voltage is set to 2V. For example, the source line input voltage can be obtained by Fig. The voltage signals V_SL[1:M] in 2B are obtained.

例如，通过图2B中的位线驱动电路中的多路选择器的控制信号BL_sw[1:N]将位线输入电压切换至相应的电压区间。例如在对忆阻器进行置位操作时，将位线输入电压设置为2V，例如在对忆阻器进行复位操作时，将位线输入电压设置为0V，例如，位线输入电压可以通过图2B中DAC得到。For example, the bit line input voltage is switched to the corresponding voltage range through the control signals BL_sw[1:N] of the multiplexer in the bit line driving circuit in FIG. 2B . For example, when the memristor is set, the bit line input voltage is set to 2V. For example, when the memristor is reset, the bit line input voltage is set to 0V. For example, the bit line input voltage can be obtained by Fig. DAC obtained in 2B.

例如，当忆阻器阵列处于计算模式时，忆阻器阵列中的忆阻器处于可用于计算的导电状态，列信号输入端输入的位线输入电压不会改变忆阻器的电导值，例如，可以通过忆阻器阵列执行乘加运算完成计算。例如，通过图2B中的字线驱动电路中的多路选择器的控制信号WL_sw[1:M]将字线输入电压切换至相应的电压区间，例如施加开启信号时，相应行的字线输入电压设置为5V，例如不施加开启信号时，相应行的字线输入电压设置为0V，例如接通GND信号；通过图2B中的源线驱动电路中的多路选择器的控制信号SL_sw[1:M]将源线输入电压切换至相应的电压区间，例如将源线输入电压设置为0V，从而使得多个行信号输出端的电流信号可以流入数据输出电路，通过图2B中的位线驱动电路中的多路选择器的控制信号BL_sw[1:N]将位线输入电压切换至相应的电压区间，例如将位线输入电压设置为0.1V-0.3V，从而利用忆阻器阵列进行乘加运算。For example, when the memristor array is in calculation mode, the memristors in the memristor array are in a conductive state that can be used for calculation, and the bit line input voltage input at the column signal input terminal does not change the conductance value of the memristor, e.g. , the calculation can be done by performing a multiply-add operation with an array of memristors. For example, the word line input voltage is switched to the corresponding voltage range by the control signal WL_sw[1:M] of the multiplexer in the word line driving circuit in FIG. 2B, for example, when the turn-on signal is applied, the word line input voltage of the corresponding row is The voltage is set to 5V, for example, when the turn-on signal is not applied, the input voltage of the word line of the corresponding row is set to 0V, for example, the GND signal is turned on; through the control signal SL_sw[1 of the multiplexer in the source line driver circuit in FIG. 2B :M] Switch the input voltage of the source line to the corresponding voltage range, for example, set the input voltage of the source line to 0V, so that the current signals of the multiple row signal output terminals can flow into the data output circuit, through the bit line driving circuit in FIG. 2B The control signal BL_sw[1:N] of the multiplexer in the bit line switches the input voltage of the bit line to the corresponding voltage range, for example, the input voltage of the bit line is set to 0.1V-0.3V, so as to use the memristor array to multiply and add operation.

例如，数据输出电路可以包括多个跨阻放大器(TIA)、ADC，可以将多个行信号输出端的电流信号转换为电压信号，而后转换为数字信号，以用于后续处理。For example, the data output circuit may include multiple transimpedance amplifiers (TIAs), ADCs, and may convert current signals at multiple row signal output terminals into voltage signals, and then into digital signals for subsequent processing.

图2C为另一种忆阻器装置的示意图。图2C所示的忆阻器装置与图2B所示的忆阻器装置的结构基本相同，也包括忆阻器阵列及其外围驱动电路。例如，如图2C所示，该忆阻器装置信号获取装置、字线驱动电路、位线驱动电路、源线驱动电路、忆阻器阵列以及数据输出电路。2C is a schematic diagram of another memristor device. The structure of the memristor device shown in FIG. 2C is basically the same as that of the memristor device shown in FIG. 2B , and also includes a memristor array and its peripheral driving circuit. For example, as shown in FIG. 2C, the memristor device includes a signal acquisition device, a word line driver circuit, a bit line driver circuit, a source line driver circuit, a memristor array, and a data output circuit.

例如，忆阻器阵列包括M条源线、2M条字线和2N条位线，以及阵列排布为M行N列的多个忆阻器单元。例如，每个忆阻器单元为2T2R结构，将用于变换处理的参数矩阵映射于忆阻器阵列中不同的多个忆阻器单元的操作，这里不再赘述。需要说明的是，忆阻器阵列也可以包括M条源线、M条字线和2N条位线，以及阵列排布为M行N列的多个忆阻器单元。For example, a memristor array includes M source lines, 2M word lines, and 2N bit lines, and a plurality of memristor cells arranged in M rows and N columns. For example, each memristor unit has a 2T2R structure, and the operation of mapping the parameter matrix used for transformation processing to different multiple memristor units in the memristor array will not be repeated here. It should be noted that the memristor array may also include M source lines, M word lines and 2N bit lines, and a plurality of memristor cells arranged in M rows and N columns.

关于信号获取装置、控制驱动电路以及数据输出电路的描述可以参照之前的描述，这里不再赘述。For the description of the signal acquisition device, the control driving circuit, and the data output circuit, reference may be made to the previous description, which will not be repeated here.

图2D示出了将贝叶斯神经网络的权重矩阵映射到忆阻器阵列的过程。利用忆阻器阵列实现贝叶斯神经网络中层与层之间的权重矩阵，对每个权重使用N个忆阻器实现与该权重对应的分布，N为大于等于2的整数，针对该权重的对应的随机概率分布，计算得到N个电导值，将该N个电导值分布映射到该N个忆阻器中。如此，将贝叶斯神经网络中的权重矩阵转换为目标电导值映射到忆阻器阵列的交叉序列中。Figure 2D shows the process of mapping the weight matrix of a Bayesian neural network to a memristor array. Use the memristor array to realize the weight matrix between layers in the Bayesian neural network, and use N memristors for each weight to realize the distribution corresponding to the weight, where N is an integer greater than or equal to 2. The corresponding random probability distribution is calculated to obtain N conductance values, and the N conductance value distributions are mapped to the N memristors. In this way, the weight matrices in the Bayesian neural network are transformed into target conductance values mapped into the crossover sequence of the memristor array.

如图2D所示，图中的左侧是一个三层贝叶斯神经网络，该贝叶斯神经网络包括逐一连接的3层神经元层。例如，输入层包括第1层神经元层，隐含层包括第2层神经元层，输出层包括第3层神经元层。例如，输入层将接收的输入数据传递到隐含层，隐含层对该输入数据进行计算转换发送至输出层，输出层输出贝叶斯神经网络的输出结构。As shown in Figure 2D, the left side of the figure is a three-layer Bayesian neural network, which includes 3 layers of neurons connected one by one. For example, the input layer includes the first layer of neurons, the hidden layer includes the second layer of neurons, and the output layer includes the third layer of neurons. For example, the input layer transmits the received input data to the hidden layer, the hidden layer calculates and transforms the input data and sends it to the output layer, and the output layer outputs the output structure of the Bayesian neural network.

如图2D所示，输入层、隐含层以及输出层均包括多个神经元节点，各层的神经元节点的个数可以根据不同的应用情况设定。例如，输入层的神经元个数为2(包括N₁和N₂)，中间隐藏层的神经元个数为3(包括N₃、N₄和N₅)，输出层的神经元个数为1(包括N₆)。As shown in FIG. 2D , the input layer, the hidden layer and the output layer all include multiple neuron nodes, and the number of neuron nodes in each layer can be set according to different application conditions. For example, the number of neurons in the input layer is 2 (including N ₁ and N ₂ ), the number of neurons in the middle hidden layer is 3 (including N ₃ , N ₄ and N ₅ ), and the number of neurons in the output layer is 1 (including N ₆ ).

如图2D所示，贝叶斯神经网络的相邻两层神经元层之间通过权重矩阵连接。例如，权重矩阵由如图2D右侧的忆阻器阵列实现。例如，可以将权重参数直接编程为忆阻器阵列的电导。例如，也可以将权重参数按照某一规则映射到忆阻器阵列的电导。例如，也可以利用两个忆阻器的电导的差值来代表一个权重参数。虽然本公开以将权重参数直接编程为忆阻器阵列的电导或将权重参数按照某一规则映射到忆阻器阵列的电导的方式对本公开的技术方案进行了描述，但其仅是示例性的，而不是对本公开的限制。As shown in Figure 2D, two adjacent neuron layers of the Bayesian neural network are connected by a weight matrix. For example, the weight matrix is implemented by a memristor array as shown on the right in Figure 2D. For example, the weight parameter can be programmed directly to the conductance of the memristor array. For example, the weight parameter can also be mapped to the conductance of the memristor array according to a certain rule. For example, the difference in conductance of two memristors can also be used to represent a weight parameter. Although the present disclosure describes the technical solutions of the present disclosure by directly programming the weight parameters into the conductance of the memristor array or by mapping the weight parameters to the conductance of the memristor array according to a certain rule, it is only exemplary , not a limitation of the present disclosure.

图2D中的右侧的忆阻器阵列的结构例如如图2A所示，该忆阻器阵列可以包括阵列排布的多个忆阻器。如图2D所示的示例中，连接输入N1与输出N3之间的权重由3个忆阻器(G₁₁、G₁₂、G₁₃)实现，权重矩阵中的其他权重可以相同地实现。更具体而言，源线SL₁对应神经元N₃，源线SL₂对应神经元N₄，源线SL₅对应神经元N₅，位线BL₁、BL₂和BL₃对应神经元N1，输入层和隐藏层之间的一个权重(神经元N₁和神经元N₃之间的权重)按照分布被转换为三个目标电导值，并分布映射到忆阻器阵列的交叉序列中，这里目标电导值分别为G₁₁、G₁₂和G₁₃，在忆阻器阵列中用虚线框框出。The structure of the memristor array on the right side in FIG. 2D is, for example, as shown in FIG. 2A , the memristor array may include a plurality of memristors arranged in an array. In the example shown in Figure 2D, the weights connecting the input N1 and the output N3 are implemented by 3 memristors (G ₁₁ , G ₁₂ , G ₁₃ ), other weights in the weight matrix can be implemented the same. More specifically, source line SL ₁ corresponds to neuron N ₃ , source line SL ₂ corresponds to neuron N ₄ , source line SL ₅ corresponds to neuron N ₅ , bit lines BL ₁ , BL ₂ and BL ₃ correspond to neuron N1 , A weight between the input layer and the hidden layer (the weight between neuron N ₁ and neuron N ₃ ) is transformed into three target conductance values according to the distribution, and the distribution is mapped into the cross sequence of the memristor array, here The target conductance values are G ₁₁ , G ₁₂ , and G ₁₃ , respectively, framed by dashed lines in the memristor array.

回到图1B，步骤S113：将对应于动态系统的时刻t的状态和隐输入变量作为输入信号输入到权重映射后的忆阻器阵列，通过忆阻器阵列对时刻t的状态和隐输入变量按照贝叶斯神经网络进行处理，从忆阻器阵列获取对应于处理结果的输出信号，输出信号用于得到动态系统的时刻t+1的预测结果。Returning to FIG. 1B , step S113 : input the state and latent input variables corresponding to the time t of the dynamic system as input signals to the weight-mapped memristor array, and the state and latent input variables of time t are analyzed by the memristor array. The processing is performed according to the Bayesian neural network, and the output signal corresponding to the processing result is obtained from the memristor array, and the output signal is used to obtain the prediction result of the dynamic system at time t+1.

例如，在本公开的一些实施例中，动态环境模型的表达为s_t+1＝f(s_t,a_t；W,ε)，s_t是动态系统的时刻t的状态，a_t是对象策略在时刻t的动作，W是贝叶斯神经网络的权重矩阵，ε是对应于忆阻器阵列的加性噪声，s_t+1是动态系统的时刻t+1的预测结果；对象策略在时刻t的动作a_t＝π(s_t；Wπ)，π表示对象策略的函数，Wπ表示策略参数，贝叶斯神经网络的权重矩阵W满足分布W～q(W)，加性噪声ε为加性高斯噪声ε～N(0,σ²)。For example, in some embodiments of the present disclosure, the dynamic environment model is expressed as s _t ₊₁ = f(s _t , at ; W, ε), where s _t is the state of the dynamic system at time _t , and at is the object The action of the policy at time t, W is the weight matrix of the Bayesian neural network, ε is the additive noise corresponding to the memristor array, and s _t+1 is the prediction result of the dynamic system at time t+1; the object policy is at The action at time _t = π(s _t ; Wπ), π represents the function of the target strategy, Wπ represents the strategy parameter, the weight matrix W of the Bayesian neural network satisfies the distribution W～q(W), and the additive noise ε is Additive Gaussian noise ε～N(0,σ ² ).

对于忆阻器阵列，输入信号为电压信号，输出信号为电流信号，读取输出信号并将输出信号进行模数转换以用于后续处理。例如，将输入序列以电压脉冲的方式施加在BL(Bit-line，位线)上，然后采集从SL(Source-line，源线)流出的输出电流进行进一步的计算处理。例如，对于如图2B或2C所示的忆阻器装置，可以通过DAC将输入序列转换为模拟电压信号，模拟电压信号通过多路选择器施加BL上。对应地，从SL获取输出电流，该电流可通过跨阻放大器转换成电压信号，通过ADC转换成数字信号，该数字信号可以用于后续处理。N个忆阻器在读电流且N比较大时，输出的总电流呈现一定的分布，例如类似于高斯分布或拉普拉斯分布等分布。所有电压脉冲的输出总电流就是输入向量与权重矩阵相乘的结果。在忆阻器交叉阵列中，这样的一次并行读操作就相当于实现了采样和向量矩阵乘法的两个操作。For a memristor array, the input signal is a voltage signal and the output signal is a current signal, and the output signal is read and analog-to-digital converted for subsequent processing. For example, the input sequence is applied to BL (Bit-line, bit line) in the form of voltage pulses, and then the output current flowing out from SL (Source-line, source line) is collected for further calculation processing. For example, for a memristor device as shown in Figure 2B or 2C, the input sequence can be converted by a DAC into an analog voltage signal, which is applied to BL through a multiplexer. Correspondingly, an output current is obtained from the SL, which can be converted into a voltage signal by a transimpedance amplifier, and converted into a digital signal by an ADC, and the digital signal can be used for subsequent processing. When N memristors read current and N is relatively large, the total output current exhibits a certain distribution, for example, a distribution similar to a Gaussian distribution or a Laplace distribution. The total output current of all voltage pulses is the result of multiplying the input vector by the weight matrix. In the memristor interleaved array, such a parallel read operation is equivalent to two operations of sampling and vector-matrix multiplication.

下面通过图3说明如何根据动态环境模型以及对象策略进行多个时刻的多次预测，得到包括对象策略对应于多个时刻的优化代价的数据样本集合。The following describes how to perform multiple predictions at multiple times according to the dynamic environment model and the object strategy, and obtain a data sample set including the optimization costs of the object strategy corresponding to multiple times.

例如，多个时刻包括从早到晚依序排列的时刻1到时刻T。For example, the plurality of times include time 1 to time T arranged in order from early to late.

图3示出了图1A中步骤S102的示例的示意性流程图。FIG. 3 shows a schematic flowchart of an example of step S102 in FIG. 1A .

如图3所示，步骤S102可以包括如下的步骤S301～S303。As shown in FIG. 3 , step S102 may include the following steps S301 to S303.

步骤S301：对于时刻1到时刻T中的任一时刻t-1，由对象策略获得执行动作a_t-1，由a_t-1＝π(s_t-1；Wπ)得到对象策略在时刻t-1的动作a_t-1。Step S301: For any time t-1 from time 1 to time T, the execution action a _t-1 is obtained from the target strategy, and the target strategy at time t is obtained from a _t-1 =π(s _t-1 ; Wπ) -1 action a _t-1 .

例如，动作a_t-1为对象策略在时刻t-1的状态下选择的最优的动作。For example, action a _t-1 is the optimal action selected by the target policy in the state of time t-1.

步骤S302：根据动态环境模型s_t＝f(s_t-1,a_t-1；W,ε)计算得到时刻t-1之后的下一时刻t的状态s_t并获得对应于时刻t的状态s_t的代价c_t，由此得到从时刻1到时刻t的代价序列{c₁,c₂,…,c_t}，基于代价序列获得时刻t的优化代价J_t-1，其中，1≤t≤T。Step S302: According to the dynamic environment model s _t =f(s _t-1 , at _-1 ; W, ε), the state s _t at the next time t after the time t-1 is calculated and the state corresponding to the time t is obtained cost c _t of s _t , thus obtain the cost sequence {c ₁ ,c ₂ ,...,c _t } from time 1 to time t, and obtain the optimization cost J _t-1 at time t based on the cost sequence, where 1≤ t≤T.

例如，在本公开的一些实施例中，步骤S302的示例可以包括：对隐输入变量z从p(z)分布中采样得到样本；将样本和t-1时刻的状态s_t-1输入到权重映射后的忆阻器阵列得到预测状态s_t；对于预测状态s_t，获得代价c_t＝c(s_t)。For example, in some embodiments of the present disclosure, an example of step S302 may include: sampling the latent input variable z from the p(z) distribution to obtain a sample; inputting the sample and the state s _{t-1 at time t-1} into the weight The mapped memristor array yields the predicted state s _t ; for the predicted state s _t , the cost _{ct =c(s t} ₎ is obtained.

例如，首先对隐输入变量z从p(z)分布中采样，然后将t-1时刻的状态s_t-1和隐输入变量的样本以忆阻器阵列的读取(READ)电压脉冲施加在BL上，然后采集从SL流出的输出电流进行进一步的计算处理得到对应于时刻t的代价c_t。对时刻1到时刻t中的任一时刻的状态均进行上述操作，则可以得到代价序列{c₁,c₂,…,c_t}。For example, the latent input variable z is first sampled from the p(z) distribution, then the state s _{t-1 at time t-1} and the sample of the latent input variable are applied with the read (READ) voltage pulse of the memristor array on BL, and then collect the output current flowing from SL for further calculation processing to obtain the cost ct corresponding to time _t . The above operations are performed on the state at any time from time 1 to time t, and the cost sequence {c ₁ ,c ₂ ,...,c _t } can be obtained.

步骤S303：得到时刻1到时刻T的数据样本集合{[a₀,J₀],…,[a_T-1,J_T-1]}。Step S303: Obtain a data sample set {[a ₀ , J ₀ ], . . . , [a _T-1 , J _T-1 ]} from time 1 to time T.

例如，在时刻t上的代价c_t的期望值为E[c_t]，则时刻t的优化代价可以通过

来获得。For example, the expected value of the cost c _t at time t is E[c _t ], then the optimization cost at time t can be obtained by

to obtain.

例如，在本公开的一些实施例中，代价还包括偶然不确定性带来的代价变化和认知不确定性带来的代价变化，偶然不确定性是由隐输入变量引起的，认知不确定性是由忆阻器阵列的本征噪声引起的。For example, in some embodiments of the present disclosure, the cost also includes the cost change caused by accidental uncertainty and the cost change caused by cognitive uncertainty. The accidental uncertainty is caused by latent input variables, and the cognitive uncertainty The determinism is caused by the intrinsic noise of the memristor array.

例如，若进一步考虑偶然不确定性和认知不确定性带来的代价变化，则时刻t的优化代价可以通过

来获得，σ(η,θ)为偶然不确定性和认知不确定性的函数，η表示偶然不确定性，θ表示认知不确定性。For example, if the cost changes caused by accidental uncertainty and cognitive uncertainty are further considered, the optimization cost at time t can be obtained by

对于时刻1到时刻T之间的任一时刻，可以得到该时刻对应的数据样本，时刻1到时刻T的数据样本集合为{[a₀,J₀],…,[a_T-1,J_T-1]}。For any time between time 1 and time T, the data sample corresponding to this time can be obtained, and the data sample set from time 1 to time T is {[a ₀ ,J ₀ ],...,[a _T-1 ,J _T-1 ]}.

例如，上述基于忆阻器阵列的动态环境模型的策略优化方法的示例性流程如下：For example, the exemplary flow of the above-mentioned policy optimization method based on the dynamic environment model of the memristor array is as follows:

输入：基于忆阻器阵列的动态环境模型和初始对象策略Input: Memristor array based dynamic environment model and initial object policy

n＝1到N，循环n=1 to N, loop

初始化状态s₀ Initialize state s ₀

t＝1到T，循环t=1 to T, loop

由对象策略π得到执行动作a_t-1 The execution action a _t-1 is obtained from the object policy π

利用基于忆阻器阵列的动态环境模型s_t＝f(s_t-1,a_t-1；W,ε)预测时刻t的状态s_t，得到数据样本[s_t]Using the dynamic environment model based on the memristor array s _t =f(s _t-1 ,at _-1 ; W,ε) to predict the state s _t at time t, and obtain the data sample [s _t ]

计算该时刻对象策略对应的代价和优化代价Calculate the cost and optimization cost corresponding to the object strategy at this moment

c_t＝c(s_t)c _t =c(s _t )

记录[a_t-1,J_t-1]record[a _t-1 ,J _t-1 ]

得到数据样本集合{[a₀,J₀],…,[a_T-1,J_T-1]}，并在数据样本集合上，使用策略梯度优化算法进行策略搜索Obtain a set of data samples {[a ₀ ,J ₀ ],…,[a _T-1 ,J _T-1 ]}, and use the policy gradient optimization algorithm to perform policy search on the data sample set

结束n的循环end the loop of n

输出：优化的策略πOutput: Optimized policy π

图4示出了本公开至少一个实施例提供的策略优化方法的一个示例的示意图。FIG. 4 shows a schematic diagram of an example of a policy optimization method provided by at least one embodiment of the present disclosure.

如图4所示，在一个示例性的应用中，一艘船被驱动并由此与海浪作斗争，以尽可能地靠近海岸线上的目标位置，由此需要训练用于驱动船的控制模型。位于位置(x,y)的船可以选择一个动作(a_x,a_y)，该动作表示驱动的方向和幅度。然而，由于具有海浪的海面的动态环境，船的后续位置呈现出漂移和扰动。并且，位置越靠近海岸，干扰越大。船仅被赋予了有限的空间位置转换的批量数据集，并且无法通过直接与海洋环境交互来优化动作策略以确保安全。这时，需要依靠经验数据来学习能够预测下一个状态的海洋环境模型(动态环境模型)。认知不确定性和偶然不确定性将分别源于未访问位置的缺失信息和海洋环境的随机性。As shown in Fig. 4, in one exemplary application, a boat is driven and thereby combats the waves to get as close as possible to a target location on the shoreline, thereby requiring training of a control model for driving the boat. A boat at position (x,y) can choose an action (a _x ,a _y ) that represents the direction and magnitude of the drive. However, due to the dynamic environment of the sea surface with waves, the subsequent position of the ship exhibits drift and disturbance. Also, the closer the location is to the coast, the greater the disturbance. Ships are only given a limited batch dataset of spatial position transformations, and there is no way to optimize action policies for safety by directly interacting with the marine environment. At this time, it is necessary to rely on empirical data to learn a marine environment model (dynamic environment model) that can predict the next state. Epistemic uncertainty and contingent uncertainty would arise from the missing information of unvisited locations and the randomness of the marine environment, respectively.

在此实施例中，对于控制模型，海面即为一个动态环境，对象策略是指船从当前位置到目标位置的求解过程中使用的方法。首先，获取针对该动态环境的动态环境模型和初始对象策略。船的初始化状态为当前位置，由对象策略得到当前时刻的执行动作，利用动态环境模型预测下一时刻的状态(船的位置)，计算对象策略对应的代价和优化代价，并记录动作和优化代价组成的数据样本。假设当前时刻为时刻1，对于时刻1到后续的时刻T，得到数据样本集合，并在该数据样本集合上使用策略梯度优化算法进行策略搜索，从而得到经过优化的对象策略。In this embodiment, for the control model, the sea surface is a dynamic environment, and the object strategy refers to the method used in the process of solving the ship from the current position to the target position. First, a dynamic environment model and an initial object policy for the dynamic environment are obtained. The initial state of the ship is the current position, the execution action at the current moment is obtained from the object strategy, the state of the next moment (the position of the ship) is predicted by the dynamic environment model, the cost and optimization cost corresponding to the object strategy are calculated, and the action and optimization cost are recorded. composed data samples. Assuming that the current time is time 1, from time 1 to the subsequent time T, a data sample set is obtained, and a policy gradient optimization algorithm is used to perform policy search on the data sample set, thereby obtaining an optimized object policy.

图5示出了本公开至少一实施例提供的一种利用基于忆阻器阵列的动态环境模型的策略优化装置500的示意框图，该策略优化装置可以用于执行图1A所示的数据处理方法。FIG. 5 shows a schematic block diagram of a strategy optimization apparatus 500 using a dynamic environment model based on a memristor array provided by at least one embodiment of the present disclosure. The strategy optimization apparatus can be used to execute the data processing method shown in FIG. 1A . .

如图5所示，策略优化装置500包括获取单元501、计算单元502以及策略搜索单元503。As shown in FIG. 5 , the strategy optimization apparatus 500 includes an acquisition unit 501 , a calculation unit 502 and a strategy search unit 503 .

获取单元501被配置为获取基于忆阻器阵列的动态环境模型。The acquisition unit 501 is configured to acquire a dynamic environment model based on the memristor array.

计算单元502被配置为，根据动态环境模型以及对象策略进行多个时刻的多次预测，得到包括对象策略对应于多个时刻的优化代价的数据样本集合。The computing unit 502 is configured to perform multiple predictions at multiple times according to the dynamic environment model and the object strategy, and obtain a data sample set including the optimization cost of the object strategy corresponding to the multiple times.

策略搜索单元503被配置为，基于数据样本集合，使用策略梯度优化算法进行策略搜索以对对象策略进行优化。The policy search unit 503 is configured to, based on the data sample set, perform a policy search using a policy gradient optimization algorithm to optimize the target policy.

例如，策略优化装置500可以采用硬件、软件、固件以及它们的任意可行的组合实现，本公开对此不作限制。For example, the policy optimization apparatus 500 may be implemented by using hardware, software, firmware and any feasible combination thereof, which is not limited in the present disclosure.

上述策略优化装置的技术效果与图1A所示的策略优化方法的技术效果相同，在此不再赘述。The technical effect of the above strategy optimization device is the same as that of the strategy optimization method shown in FIG. 1A , and details are not repeated here.

有以下几点需要说明：The following points need to be noted:

(1)本公开实施例附图只涉及到本公开实施例涉及到的结构，其他结构可参考通常设计。(1) The drawings of the embodiments of the present disclosure only relate to the structures involved in the embodiments of the present disclosure, and other structures may refer to general designs.

(2)在不冲突的情况下，本公开的实施例及实施例中的特征可以相互组合以得到新的实施例。(2) The embodiments of the present disclosure and features in the embodiments may be combined with each other to obtain new embodiments without conflict.

以上所述，仅为本公开的具体实施方式，但本公开的保护范围并不局限于此，本公开的保护范围应以所述权利要求的保护范围为准。The above descriptions are only specific embodiments of the present disclosure, but the protection scope of the present disclosure is not limited thereto, and the protection scope of the present disclosure should be subject to the protection scope of the claims.

Claims

1. A strategy optimization method utilizing a memristor array-based dynamic environment model, comprising:

obtaining the dynamic environment model based on the memristor array;

Perform multiple predictions at multiple times according to the dynamic environment model and the object strategy, and obtain a data sample set including the optimization cost of the object strategy corresponding to the multiple times;

Based on the set of data samples, a policy search is performed using a policy gradient optimization algorithm to optimize the target policy.

2. The strategy optimization method according to claim 1, wherein obtaining the dynamic environment model comprises:

obtaining a Bayesian neural network, wherein the Bayesian neural network has a weight matrix obtained by training;

Obtaining a plurality of corresponding target conductance values according to the weight matrix of the Bayesian neural network, and mapping the plurality of target conductance values to the memristor array;

Input the state and hidden input variables corresponding to the time t of the dynamic system as input signals to the weight-mapped memristor array, and the state of the time t and the hidden input variables are analyzed by the memristor array. Processing is performed according to the Bayesian neural network, and an output signal corresponding to the processing result is obtained from the memristor array, wherein the output signal is used to obtain the prediction result of the dynamic system at time t+1.

3. The strategy optimization method according to claim 2, wherein the expression of the dynamic environment model is s _t ₊₁ =f(s _t , at ; W, ε),

Among them, s _t is the state of the dynamic system at the time t, a _t is the action of the object strategy at the time t, W is the weight matrix of the Bayesian neural network, ε is the memory corresponding to the memory the additive noise of the resistor array, s _t+1 is the predicted result of the dynamic system at the time t+1;

Wherein, the action of the target strategy at time t at _t =π(s _t ; Wπ), π represents the function of the target strategy, Wπ represents the strategy parameter, and the weight matrix W of the Bayesian neural network satisfies the distribution W ~q(W), the additive noise ε is an additive Gaussian noise ε∼N(0,σ ² ).

4. The strategy optimization method according to claim 3, wherein the plurality of moments comprise time 1 to time T arranged in order from early to late,

Perform multiple predictions on the multiple moments according to the dynamic environment model and the object strategy, and obtain the data sample set including the optimization cost of the object strategy corresponding to the multiple moments, including:

For any time t-1 from the time 1 to the time T, the execution action a _t-1 is obtained from the object strategy,

Obtain the action at-1 of the object strategy at the time t-1 from at _-1 =π(s _t-1 _; Wπ);

According to the dynamic environment model s _t =f(s _t-1 , at _-1 ; W, ε), the state s _t at the next time t after the time t-1 is calculated and obtained corresponding to the time the cost c t of the state s _t of _t , thus obtaining the cost sequence {c ₁ ,c ₂ ,...,c _t } from the time 1 to the time t,

Obtain the optimization cost J _t-1 at the time t based on the cost sequence, where 1≤t≤T;

Obtain the data sample set {[a ₀ , J ₀ ],...,[a _T-1 ,J _T-1 ]} from the time 1 to the time T.

5. The strategy optimization method according to claim 4, wherein the expected value of the cost c _t at the time t is E[c _t ], then the optimization cost of the time t can be obtained by

to obtain.

6. The strategy optimization method according to claim 4, wherein the cost further comprises a cost change caused by accidental uncertainty and a cost change caused by cognitive uncertainty,

Wherein the contingent uncertainty is caused by the latent input variable and the epistemic uncertainty is caused by the intrinsic noise of the memristor array.

7. The strategy optimization method according to claim 6, wherein the optimization cost at the time t is obtained by

to obtain, where σ(η, θ) is a function of the contingent uncertainty and the epistemic uncertainty, η represents the contingent uncertainty, and θ represents the epistemic uncertainty.

8. The strategy optimization method according to claim 4, wherein the state s _t at the time t is obtained by calculating the dynamic environment model s _t =f(s _t-1 , at _-1 ; W, ε) and obtain the cost ct corresponding to the state s _t at the time _t , including:

sampling the hidden input variable z from the p(z) distribution to obtain a sample;

Inputting the sample and the state s _{t-1 at the time t-1} into the weight-mapped memristor array to obtain the predicted state s _t ;

For the predicted state st, the cost _ct = c(s _t ) _is obtained.

9. The policy optimization method according to claim 1, wherein the policy gradient optimization algorithm comprises a REINFORCE algorithm, a PRO algorithm or a TPRO algorithm.

10. A strategy optimization device utilizing a memristor array-based dynamic environment model, comprising:

an acquisition unit, configured to acquire the dynamic environment model based on the memristor array;

a computing unit, configured to perform multiple predictions at multiple times according to the dynamic environment model and the object strategy, and obtain a data sample set including the optimization cost of the object strategy corresponding to the multiple times;

A policy search unit, configured to use a policy gradient optimization algorithm to perform policy search based on the data sample set to optimize the target policy.