WO2024224501A1

WO2024224501A1 - Action determination method and action determination device

Info

Publication number: WO2024224501A1
Application number: PCT/JP2023/016404
Authority: WO
Inventors: 舞竹内; 海図浅井
Original assignee: TDK Corp
Current assignee: TDK Corp
Priority date: 2023-04-26
Filing date: 2023-04-26
Publication date: 2024-10-31
Anticipated expiration: 2025-10-26

Abstract

This action determination method includes an action determination step, a state change step, and a learning step. In the action determination step, an optimization calculation is executed on the basis of a policy represented by a model applicable to the optimization calculation, and the action of an agent is determined. In the state change step, the action is input into an actual environment, whereby the actual environment and the agent interact, and the state of the agent changes. In the learning step, a value for the state of the agent after the action is acquired as a reward, and the next policy is determined on the basis of the reward. The optimization calculation can modify randomness by using a calculation parameter. The calculation parameter is modified on the basis of a change amount of the actual environment between the learning step and the action determination step.

Description

Behavior determination method and behavior determination device

　本発明は、行動決定方法及び行動決定装置に関する。 The present invention relates to a behavior decision-making method and a behavior decision-making device.

　機械学習は、教師あり学習と、教師なし学習と、強化学習とに、大別できる。強化学習は、自動運転、動作の最適化等の学習に適していると言われており、注目されている。 Machine learning can be broadly divided into supervised learning, unsupervised learning, and reinforcement learning. Reinforcement learning is said to be suitable for learning autonomous driving and behavior optimization, and is attracting attention.

　強化学習は、エージェントと環境とが相互作用を繰り返し、試行錯誤を行うことで、タスクを実行する学習方法である。相互作用は、エージェントと環境とが互いに情報を送受信し合うことをいう。 Reinforcement learning is a learning method in which an agent executes a task by repeatedly interacting with the environment and through trial and error. Interaction refers to the sending and receiving of information between the agent and the environment.

　強化学習において、エージェントと環境とは、状態、行動、報酬の情報を送受信し合う。状態は、エージェントが置かれている状況であり、行動は、エージェントの振る舞いである。報酬は、行動後のエージェントの状態に対する評価指標である。エージェントは、方策に基づいた行動を環境に入力する。環境は、エージェントの行動に応じて、状態と報酬をエージェントに入力する。エージェントと環境とは、報酬を最大限得られるように、相互作用を繰り返し試行錯誤する。 In reinforcement learning, the agent and the environment send and receive information on state, action, and reward. The state is the situation the agent is in, and the action is the agent's behavior. The reward is an evaluation index for the agent's state after the action. The agent inputs actions based on the policy to the environment. The environment inputs the state and reward to the agent according to the agent's actions. The agent and environment repeatedly interact through trial and error to maximize the reward.

　例えば、特許文献１には、報酬関数に最適化計算を適用した強化学習方法が開示されている。 For example, Patent Document 1 discloses a reinforcement learning method that applies optimization calculations to the reward function.

特許第７１１１１７８号公報Patent No. 7111178

　特許文献１に記載の強化学習方法では、分配関数Ｚ_Rを分母とした確率分布関数（π（ａ｜ｓ）＝ｅｘｐ（ｒ_ａ（ｓ））／Ｚ_Ｒ）で方策を表現している。エージェントは方策に従って行動を決定する。方策を確率分布で規定すると、行動決定のランダム性が一意に決まってしまう。実環境が変化しない場合は、学習された方策に従って行動を決定しても問題ないが、実環境が変化する場合は、変化前の実環境で学習された方策を基に行動が決定されてしまう。この場合、変化後の実環境での行動決定を、変化前の実環境で学習された方策に基づいて行うことになり、行動決定の妥当性が低くなる場合がある。そのため、特許文献１に記載の強化学習方法は、実環境が変化する毎に学習をし直す必要があり、実環境の変化に十分対応することができない。 In the reinforcement learning method described in Patent Document 1, the policy is expressed by a probability distribution function (π(a|s)=exp(r _a (s))/Z _R ) with the distribution function Z _R as the denominator. The agent decides on an action according to the policy. If the policy is specified by a probability distribution, the randomness of the action decision is uniquely determined. When the real environment does not change, there is no problem in deciding the action according to the learned policy, but when the real environment changes, the action is decided based on the policy learned in the real environment before the change. In this case, the action decision in the real environment after the change is made based on the policy learned in the real environment before the change, and the validity of the action decision may be reduced. Therefore, the reinforcement learning method described in Patent Document 1 needs to re-learn every time the real environment changes, and cannot adequately respond to changes in the real environment.

　本発明は上記事情に鑑みてなされたものであり、実環境の変化に応じて行動決定のランダム性を変えることで、学習時に真に最良ではない行動を選択する確率を高め、環境が変化した場合でも、その環境に適した行動を選択する可能性を高めることができる、行動決定方法及び行動決定装置を提供することを目的とする。 The present invention has been made in consideration of the above circumstances, and aims to provide a behavior decision-making method and device that can increase the probability of selecting an action that is not truly optimal during learning by changing the randomness of behavior decisions in response to changes in the real environment, and can increase the possibility of selecting an action that is appropriate for a changed environment, even if the environment changes.

　本発明は、上記課題を解決するため、以下の手段を提供する。 To solve the above problems, the present invention provides the following means.

　第１の態様に係る行動決定方法は、行動決定ステップと、状態変化ステップと、学習ステップと、を有する。行動決定ステップでは、最適化計算に適用可能なモデルで表された方策に基づいて、最適化計算を実行し、エージェントの行動を決定する。状態変化ステップでは、前記行動を実環境に入力することで、前記実環境と前記エージェントとが相互作用し、前記エージェントの状態が変化する。学習ステップでは、行動後の前記エージェントの状態の価値を報酬として取得し、前記報酬に基づいて、次の方策を決定する。前記最適化計算は、計算パラメータによってランダム性を変更できる。前記計算パラメータは、前記学習ステップと前記行動決定ステップとの間における前記実環境の変化量に基づいて変更される。 The behavior decision method according to the first aspect includes a behavior decision step, a state change step, and a learning step. In the behavior decision step, an optimization calculation is performed based on a measure represented by a model applicable to the optimization calculation, and an agent's behavior is decided. In the state change step, the behavior is input into a real environment, causing an interaction between the real environment and the agent, and changing the state of the agent. In the learning step, the value of the agent's state after the behavior is obtained as a reward, and the next measure is decided based on the reward. The optimization calculation can change randomness by a calculation parameter. The calculation parameter is changed based on the amount of change in the real environment between the learning step and the behavior decision step.

　第２の態様に係る行動決定装置は、学習部と、第１制御部と、第２制御部と、を有する。前記第１制御部は、最適化計算に適用可能なモデルで表された方策に基づいて、最適化計算を実行し、エージェントの行動を決定する。前記第２制御部は、前記行動を実環境に入力し、前記エージェントの状態を変化させる。前記学習部は、前記行動後の前記エージェントの状態を報酬として取得し、前記報酬に基づいて、次の方策を決定する。前記最適化計算は、計算パラメータによってランダム性を変更できる。前記計算パラメータは、前記エージェントの行動を決定する際と前記方策を決定する際との間の前記実環境の変化量に基づいて変更される。 The behavior decision device according to the second aspect has a learning unit, a first control unit, and a second control unit. The first control unit executes an optimization calculation and decides an agent's behavior based on a measure represented by a model applicable to the optimization calculation. The second control unit inputs the behavior into a real environment and changes the state of the agent. The learning unit obtains the state of the agent after the behavior as a reward, and decides the next measure based on the reward. The optimization calculation can change randomness by a calculation parameter. The calculation parameter is changed based on the amount of change in the real environment between when the agent's behavior is decided and when the measure is decided.

　本発明に係る行動決定方法及び行動決定装置は、環境の変化に応じて行動決定のランダム性を変えることで、環境変化に適用できる。 The behavior decision-making method and behavior decision-making device of the present invention can be adapted to environmental changes by changing the randomness of behavior decisions in response to changes in the environment.

第１実施形態に係る行動決定方法を担う強化学習の概念図である。FIG. 2 is a conceptual diagram of reinforcement learning that is a component of the behavior decision-making method according to the first embodiment. 第１実施形態に係る行動決定方法を担う強化学習における状態の変化を示す概念図である。FIG. 2 is a conceptual diagram showing state changes in reinforcement learning that is a part of the behavior decision-making method according to the first embodiment. 量子アニーリングを用いた最適化計算によって求められる解の確率分布のイメージ図である。This is an illustration of the probability distribution of solutions obtained by optimization calculations using quantum annealing. 方策を最適化計算モデルで表し、最適化計算を行うことで選択される行動の確率分布のイメージ図である。This is an image of the probability distribution of actions selected by expressing the policy as an optimization calculation model and performing optimization calculations. 方策に基づいてエージェントが選択する行動と、実環境の変化と、の関係を示すイメージ図である。This is an image diagram showing the relationship between the action selected by an agent based on a policy and changes in the real environment. 第１実施形態に係る強化学習のフロー図の一例である。FIG. 2 is an example of a flow diagram of reinforcement learning according to the first embodiment. 第１実施形態に係る行動決定装置のブロック図である。1 is a block diagram of a behavior determination device according to a first embodiment.

　以下、本実施形態について、図面を適宜参照しながら詳細に説明する。以下の説明で用いる図面は、本実施形態の特徴をわかりやすくするために便宜上特徴となる部分を拡大して示している場合があり、各構成要素の寸法比率などは実際とは異なっていることがある。以下の説明において例示される材料、寸法等は一例であって、本発明はそれらに限定されるものではなく、その要旨を変更しない範囲で適宜変更して実施することが可能である。 The present embodiment will be described in detail below with reference to the drawings as appropriate. The drawings used in the following description may show enlarged characteristic parts for the sake of convenience in order to make the features of the present embodiment easier to understand, and the dimensional ratios of each component may differ from the actual ones. The materials, dimensions, etc. exemplified in the following description are merely examples, and the present invention is not limited to them, and may be modified as appropriate within the scope of the present invention.

　図１は、第１実施形態に係る行動決定方法を担う強化学習の概念図である。強化学習は、エージェント１と実環境２とが相互作用することで学習を行い、行動ａの後のエージェント１の状態Ｓ_ａに応じた報酬ｒを求める。エージェント１の行動ａは、方策πに従って決定され、エージェント１は、決定された行動ａを行う。エージェント１の行動ａは、例えば、行動ベクトル（ａ^１、ａ^２、ａ^３、…、ａ^ｎ）で表すことができる。エージェント１の状態Ｓ_ａは、行動ａによって変化する。エージェント１の状態Ｓ_ａは、例えば、状態ベクトル（Ｓ_ａ０、Ｓ_ａ１、Ｓ_ａ２、…、Ｓ_ａｎ）で表すことができる。 FIG. 1 is a conceptual diagram of reinforcement learning that carries out the behavior decision-making method according to the first embodiment. In reinforcement learning, an agent 1 and a real environment 2 learn by interacting with each other, and a reward r is obtained according to a state S _a of the agent 1 after an action a. The action a of the agent 1 is determined according to a policy π, and the agent 1 performs the determined action a. The action a of the agent 1 can be represented, for example, by an action vector (a ¹ , a ² , a ³ , ..., a ⁿ ). The state S _a of the agent 1 changes depending on the action a. The state S _a of the agent 1 can be represented, for example, by a state vector (S _a0 , S _a1 , S _a2 , ..., S _an ).

　例えば、ロボットアームの制御を例に具体的に説明する。この場合、エージェント１はロボットアームの制御部である。実環境２はロボットアームが動作する環境である。実環境２は、例えば、ロボットアームが動作する温度、湿度、部品の摩耗状態、ロボットアームが動作する床の材質、隣接するロボットアームの動き等である。 For example, a specific explanation will be given using the control of a robot arm as an example. In this case, agent 1 is the control unit of the robot arm. The real environment 2 is the environment in which the robot arm operates. The real environment 2 is, for example, the temperature and humidity at which the robot arm operates, the wear state of parts, the material of the floor on which the robot arm operates, the movement of adjacent robot arms, etc.

　例えば、ロボットアームに接続されたｎ本の導線に電流を流すことでロボットアームが動作する場合、エージェント１の行動ａは、導線に電流を流すという動作に対応する。例えば、ｎ本の導線のそれぞれに何Ａの電流を流すかを全て指定することで、ロボットアームの一つの行動ａが決定する。 For example, if a robot arm operates by passing a current through n conductors connected to the robot arm, then action a of agent 1 corresponds to the action of passing a current through the conductors. For example, one action a of the robot arm is determined by specifying how many A of current should be passed through each of the n conductors.

　行動ベクトル（ａ^１、ａ^２、ａ^３、…、ａ^ｎ）の各要素は、ｌ番目（ｌ＝１～ｎの整数）の導線に何Ａを流すかという選択肢に対応する。例えば、ａ^１は、１番目の導線に何Ａを流すかという選択肢であり、ａ^２は、２番目の導線に何Ａを流すかという選択肢である。例えば、ｌ番目の導線に流す電流の選択肢がＫｎ通りある場合、ｌ番目の導線に｛Ｉ^ｌ _１、Ｉ^ｌ _２、…、Ｉ^ｌ _Ｋｌ｝のいずれかの電流を印加するという選択肢がある。 Each element of the action vector (a ¹ , a ² , a ³ , ..., a ⁿ ) corresponds to an option of how many amperes to pass through the l-th conductor (l=an integer from 1 to n). For example, a ¹ is an option of how many amperes to pass through the first conductor, and a ² is an option of how many amperes to pass through the second conductor. For example, if there are Kn options for the current to pass through the l-th conductor, there are options to apply any one of the currents {I ^l ₁ , I ^l ₂ , ..., I ^l _Kl } to the l-th conductor.

　エージェント１の行動を決める方策πは、例えば、ロボットアームの制御の指針である。ロボットアームの制御の指針（方策π）に基づいて、ロボットアームの制御（行動ａ）が決定される。 The policy π that determines the behavior of agent 1 is, for example, a guideline for controlling a robot arm. Based on the guideline for controlling the robot arm (policy π), the control of the robot arm (action a) is determined.

　エージェント１が行動ａを行うことで、エージェント１の状態Ｓ_ａは変化する。エージェント１の状態Ｓ_ａは、例えば、ロボットアームの関節の角度である。例えば、ロボットアームに接続されたｎ本の導線のそれぞれに電流が印加されることで、ロボットアームの関節の角度が変化し、状態Ｓ_ａが変化する。 When the agent 1 performs an action a, the state S _a of the agent 1 changes. The state S _a of the agent 1 is, for example, the angle of the joint of a robot arm. For example, when a current is applied to each of n conductors connected to the robot arm, the angle of the joint of the robot arm changes, and the state S _a changes.

　報酬ｒは、例えば、ロボットアームの状態Ｓ_ａを変化させたことによって得られた価値である。例えば、ロボットアームの正確な制御がアームを３０度曲げることである場合、エージェント１の行動ａにより生じるロボットアームの状態Ｓ_ａが３０度曲がった状態に近い程、報酬ｒは高くなる。 The reward r is, for example, a value obtained by changing the state S _a of the robot arm. For example, if the precise control of the robot arm is to bend the arm by 30 degrees, the closer the state S _a of the robot arm caused by the action a of the agent 1 is to a state bent by 30 degrees, the higher the reward r will be.

　図２は、第１実施形態に係る行動決定方法を担う強化学習における状態の変化を示す概念図である。強化学習において、エージェント１の状態は、学習により遷移する。 FIG. 2 is a conceptual diagram showing state changes in reinforcement learning, which is the behavior decision-making method according to the first embodiment. In reinforcement learning, the state of agent 1 transitions as a result of learning.

　図２におけるＳ_ａ０、Ｓ_ａ１、Ｓ_ａ２、Ｓ_ａ３、Ｓ_ａ４のそれぞれは、エージェント１の状態を表す。エージェント１の状態は、初期状態Ｓ_ａ０から方策πに最適化計算を適用することで行動を選択し、その結果に従って、例えば、Ｓ_ａ０、Ｓ_ａ１、Ｓ_ａ２、Ｓ_ａ３のいずれかに遷移する。例えば、ロボットアームの制御の場合、関節が曲がっていない初期状態Ｓ_ａ０から次の状態Ｓ_ａ０、Ｓ_ａ１、Ｓ_ａ２、Ｓ_ａ３のいずれかに遷移する。次の状態は、元の関節が曲がっていない状態Ｓ_ａ０のままでもよいし、関節が曲がった別の状態Ｓ_ａ１、Ｓ_ａ２、Ｓ_ａ３のいずれかでもよい。何れの状態に遷移するかは、方策πに基づいて決定される。例えば、関節が曲がった状態Ｓ_ａ２に遷移した場合は、次の行動によってまた別の状態Ｓ_ａ０、Ｓ_ａ３、Ｓ_ａ４のいずれかに遷移する。強化学習では、例えば、状態の遷移を生み出す行動ａ毎に報酬ｒが求められる。また複数回同じ方策に則って行動選択、行動を実行した後（エピソード毎）に報酬ｒを求めても良い。 Each of S _a0 , S _a1 , S _a2 , S _a3 , and S _a4 in FIG. 2 represents a state of the agent 1. The state of the agent 1 is selected by applying an optimization calculation to the policy π from the initial state S _a0 , and transitions to, for example, any one of S _a0 , S _a1 , S _a2 , and S _a3 according to the result. For example, in the case of control of a robot arm, transitions from the initial state S _a0 in which the joints are not bent to any one of the next states S _a0 , S _a1 , S _a2 , and S _a3 . The next state may remain the original state S _a0 in which the joints are not bent, or may be any one of the other states S _a1 , S _a2 , and S _a3 in which the joints are bent. The state to which the agent transitions is determined based on the policy π. For example, when transitioning to the state S _a2 in which the joints are bent, transitions to another state S _a0 , S _a3 , and S _a4 depending on the next action. In reinforcement learning, for example, a reward r is calculated for each action a that produces a state transition. Alternatively, the reward r may be calculated after selecting an action and executing the action multiple times according to the same policy (for each episode).

　本実施形態では、方策πを最適化計算可能なモデルで表す。そして最適化計算を行うことで、方策πに基づいて行動ａを決定する。最適化計算は、計算パラメータによってランダム性を変更できる。詳細は後述するが、方策πに基づく行動ａの決定を計算パラメータによってランダム性を変更できる最適化計算で行うことで、実環境２が方策πを決定する学習を行った際から変化した（揺らいだ）場合でも、環境に合わせた行動ａを選択する可能性が高まる。 In this embodiment, the policy π is represented by a model that can be calculated by optimization. Then, an action a is determined based on the policy π by performing an optimization calculation. The randomness of the optimization calculation can be changed by calculation parameters. Although details will be described later, by determining an action a based on the policy π using an optimization calculation that can change the randomness by calculation parameters, even if the real environment 2 has changed (fluctuated) since the learning that determined the policy π was performed, the possibility of selecting an action a that matches the environment increases.

　最適化計算は、例えば、量子コンピュータで行われる断熱量子計算で行ってもよいし、量子アニーリングで行ってもよいし、遺伝的アルゴリズムで行ってもよい。 The optimization calculations may be performed, for example, using adiabatic quantum computing performed on a quantum computer, quantum annealing, or a genetic algorithm.

　例えば、最適化計算を量子アニーリングで行う場合、最適化計算に適用可能なモデルとしてイジングモデル又はＱＵＢＯを用いることができる。また例えば、最適化計算を断熱量子計算で行う場合は、任意のエルミート行列を最適化計算に適用な可能なモデルとして用いることができる。また例えば、最適化計算を遺伝的アルゴリズムで行う場合は、任意の実数値関数を最適化計算に適用な可能なモデルとして用いることができる。 For example, when the optimization calculation is performed using quantum annealing, an Ising model or QUBO can be used as a model applicable to the optimization calculation. Furthermore, when the optimization calculation is performed using adiabatic quantum computing, any Hermitian matrix can be used as a model applicable to the optimization calculation. Furthermore, when the optimization calculation is performed using a genetic algorithm, any real-valued function can be used as a model applicable to the optimization calculation.

　例えば、方策πをイジングモデルで表す場合、方策πは以下の式（１）で表される。 For example, when the policy π is expressed by the Ising model, the policy π is expressed by the following equation (1).

　ここでπ（ａ｜Ｓ_ａ）は、最適化計算モデルであり、行動ベクトルａの各要素の関係性を表す。σ_ｉ、σ_ｊは、入力変数であり、＋１又は－１の２値のいずれかを示す。ｈ_ｉｊは、相互作用パラメータである。ｈ_ｉｊは、例えば、エージェント１の状態Ｓ_ａの関数で表される。Ｎは、選択肢をｏｎｅ－ｈｏｔ表現で表す場合は、Ｎ＝Ｋ_１＋Ｋ_２＋…＋Ｋ_ｎで表される。Ｋ_ｌは、ｌ番目（ｌ＝１～ｎの自然数）の導線に印加できる電流量の選択肢の数である。 Here, π(a|S _a ) is an optimization calculation model, and represents the relationship between each element of the behavior vector a. _{σ i} and σ _j are input variables, and each have two values, +1 or -1. h _ij is an interaction parameter. h _ij is expressed, for example, as a function of the state S _a of agent 1. When the options are expressed in one-hot representation, N is expressed as N=K ₁ +K ₂ +...+K _n . _{K l} is the number of options for the amount of current that can be applied to the lth conductor (l=a natural number from 1 to n).

　量子アニーリングを用いた最適化計算では、真の最適値を求めることができる確率が、計算時間によって変わる場合がある。例えば、無限に近い時間計算を行った後に求められた最適値と、短時間の計算を行った後に求められた最適値と、は必ずしも一致しない。 In optimization calculations using quantum annealing, the probability of finding the true optimal value can vary depending on the calculation time. For example, the optimal value found after performing calculations over a nearly infinite period of time will not necessarily be the same as the optimal value found after performing calculations over a short period of time.

　図３は、量子アニーリングを用いた最適化計算によって求められる解の確率分布のイメージ図である。図３の（ａ）は、最適化計算を長時間行った後に出力される解の確率分布であり、（ｂ）は、最適化計算を短時間行った後に出力される解の確率分布である。真の最適値はｖ_０であり、確率分布は例えば正規分布で表現される。 3 is an image diagram of the probability distribution of a solution obtained by an optimization calculation using quantum annealing. (a) in FIG. 3 is a probability distribution of a solution output after performing an optimization calculation for a long time, and (b) is a probability distribution of a solution output after performing an optimization calculation for a short time. The true optimal value is v ₀ , and the probability distribution is expressed, for example, by a normal distribution.

　量子アニーリングを用いた最適化計算を長時間行うと、真の最適値ｖ_０を選択する確率が高まり、真の最適値ｖ_０以外の値ｖ_１を最適値として出力する確率は低くなる。これに対し、量子アニーリングを用いた最適化計算の時間が短いと、真の最適値ｖ_０を選択する確率が低くなり、真の最適値ｖ_０以外の値ｖ_１を最適値として出力する確率は高くなる。これは、量子アニーリングにおける短い計算時間は、一般にはノイズとして扱われるためである。短時間の計算では、真の最適値ｖ_０以外の値ｖ_１を最適値として出力する場合がある。一般的な最適化計算では、真の最適値ｖ_０以外の値ｖ_１を最適値として出力することは問題であるが、本実施形態に係る行動決定方法ではこの特性を利用する。 When the optimization calculation using quantum annealing is performed for a long time, the probability of selecting the true optimal value _v0 increases, and the probability of outputting a value _v1 other than the true optimal value _v0 as the optimal value decreases. In contrast, when the optimization calculation time using quantum annealing is short, the probability of selecting the true optimal value _v0 decreases, and the probability of outputting a value _v1 other than the true optimal value _v0 as the optimal value increases. This is because the short calculation time in quantum annealing is generally treated as noise. In a short calculation, a value _v1 other than the true optimal value _v0 may be output as the optimal value. In a general optimization calculation, it is problematic to output a value _v1 other than the true optimal value _v0 as the optimal value, but the behavior decision-making method according to this embodiment utilizes this characteristic.

　図４は、方策πを最適化計算モデルで表し、最適化計算を行うことで選択される行動ａの確率分布のイメージ図である。図４の（ａ）は、方策πに基づく最適化計算を長時間行った後に選択される行動ａの確率分布であり、（ｂ）は、方策πに基づく最適化計算を短時間行った後に選択される行動ａの確率分布である。 Figure 4 shows an image of the probability distribution of action a selected by performing optimization calculations, with the policy π represented as an optimization calculation model. Figure 4 (a) shows the probability distribution of action a selected after performing optimization calculations based on the policy π for a long period of time, and (b) shows the probability distribution of action a selected after performing optimization calculations based on the policy π for a short period of time.

　方策πを最適化計算モデルで表し、最適化計算を行うことで行動ａを求める場合、最適化計算の計算時間によって、最適とされる行動ａ_０以外の行動ａ_１を選択する確率が変化する。例えば、計算時間が長い程、最適とされる行動ａ_０以外の行動ａ_１を選択する確率は低くなり、計算時間が短い程、最適とされる行動ａ_０以外の行動ａ_１を選択する確率が高まる。量子アニーリングにおける計算時間を変えると、最適化計算のランダム性を変更できる。 When the policy π is represented by an optimization calculation model and an action a is obtained by performing an optimization calculation, the probability of selecting an action _a1 other than the optimal action _a0 changes depending on the calculation time of the optimization calculation. For example, the longer the calculation time, the lower the probability of selecting an action _a1 other than the optimal action _a0 , and the shorter the calculation time, the higher the probability of selecting an action _a1 other than the optimal action _a0 . Changing the calculation time in quantum annealing can change the randomness of the optimization calculation.

　図５は、方策に基づいてエージェントが選択する行動と、実環境の変化と、の関係を示すイメージ図である。実環境２が変化すると最適とされる行動が変化する場合がある。例えば、変化前の実環境２において行動ａ_０が最適であったとしても、実環境２が変化することで行動ａ_０’が最適となる場合がある。 5 is an image diagram showing the relationship between the action selected by the agent based on the policy and the change in the real environment. The optimal action may change when the real environment 2 changes. For example, even if the action _a0 was optimal in the real environment 2 before the change, the action _a0 ' may become optimal when the real environment 2 changes.

　最適化計算のランダム性が小さい（例えば、最適化計算の時間が長い）と、所定の実環境２において最適とされる行動ａ_０以外の行動を選択する確率が低い。この場合、実環境２が変化した際でも、変化前の実環境２において最適とされる行動ａ_０を選択し続ける確率が高くなる。換言すると、変化後の実環境２において最適である行動ａ_０’を選択する確率は低く、変化後の実環境２において最適である行動ａ_０’をしない場合もありうる。 When the randomness of the optimization calculation is small (for example, when the optimization calculation takes a long time), there is a low probability of selecting an action other than the action _a0 that is considered optimal in a given real environment 2. In this case, even when the real environment 2 changes, there is a high probability of continuing to select the action _a0 that is considered optimal in the real environment 2 before the change. In other words, there is a low probability of selecting the action _a0 ' that is optimal in the real environment 2 after the change, and there may be cases where the action _a0 ' that is optimal in the real environment 2 after the change is not performed.

　これに対し、最適化計算のランダム性が大きい（例えば、最適化計算の時間が短い）と、所定の実環境２において最適とされる行動ａ_０以外の行動を選択する確率が高い。この場合、実環境２が変化した場合でも、変化前の実環境２において最適とされる行動ａ_０以外の行動を選択し得る余地がある。つまり、変化前の実環境２において最適とされる行動ａ_０以外の行動を選択する確率が高いということは、実環境２が変化した後の環境において最適とされる行動ａ_０’を選択する確率が高いということと等しい。 In contrast, when the randomness of the optimization calculation is high (for example, the optimization calculation time is short), there is a high probability of selecting an action other than the action _a0 considered optimal in a given real environment 2. In this case, even if the real environment 2 changes, there is room for selecting an action other than the action _a0 considered optimal in the real environment 2 before the change. In other words, a high probability of selecting an action other than the action _a0 considered optimal in the real environment 2 before the change is equivalent to a high probability of selecting an action _a0 ' considered optimal in the environment after the real environment 2 has changed.

　本実施形態に係る行動決定方法では、実環境の変化量に基づいて、最適化計算のランダム性を変化させる。最適化計算のランダム性は、計算パラメータを変更することで、変更できる。上述のように、最適化計算が量子アニーリングの場合は、最適化計算の計算時間が計算パラメータの一つである。 In the behavior decision-making method according to this embodiment, the randomness of the optimization calculation is changed based on the amount of change in the real environment. The randomness of the optimization calculation can be changed by changing the calculation parameters. As described above, when the optimization calculation is performed using quantum annealing, the calculation time of the optimization calculation is one of the calculation parameters.

　遺伝的アルゴリズムで最適化計算を行う場合は、突然変異が起きる確率が最適化計算のランダム性に寄与する計算パラメータの一つである。突然変異パラメータを変更することで、最適化計算のランダム性を変更できる。また量子コンピュータで行われる断熱量子計算で最適化を行う場合は、計算器に加わる熱等のノイズがランダム性に寄与する計算パラメータの一つである。 When optimization calculations are performed using a genetic algorithm, the probability of mutations occurring is one of the calculation parameters that contribute to the randomness of the optimization calculation. The randomness of the optimization calculation can be changed by changing the mutation parameters. When optimization is performed using adiabatic quantum computing, which is performed on a quantum computer, noise such as heat applied to the calculator is one of the calculation parameters that contribute to randomness.

　図６は、第１実施形態に係る強化学習のフロー図の一例である。第１実施形態に係る行動決定方法は、行動決定ステップＳ１と状態変化ステップＳ２と学習ステップＳ３とを有する。 FIG. 6 is an example of a flow diagram of reinforcement learning according to the first embodiment. The behavior decision method according to the first embodiment has a behavior decision step S1, a state change step S2, and a learning step S3.

　行動決定ステップＳ１では、最適化計算に適用可能なモデルで表された方策πに基づいて、最適化計算を実行し、エージェント１の行動ａを決定する。状態変化ステップＳ２では、行動ａを実環境２に入力することで、実環境２とエージェント１とが相互作用し、エージェント１の状態が変化する。学習ステップＳ３では、行動ａ後のエージェント１の状態Ｓ_ａの価値を報酬ｒとして取得し、報酬ｒに基づいて、次の方策πを決定する。 In the action decision step S1, an optimization calculation is performed based on a policy π represented by a model applicable to the optimization calculation, and an action a of the agent 1 is determined. In the state change step S2, the action a is input to the real environment 2, whereby the real environment 2 and the agent 1 interact with each other, changing the state of the agent 1. In the learning step S3, the value of the state S _a of the agent 1 after the action a is obtained as a reward r, and the next policy π is determined based on the reward r.

　行動決定ステップＳ１は、例えば、環境変化量測定ステップＳ１１と、計算パラメータ設定ステップＳ１２と、最適化計算ステップＳ１３と、を有する。 The action decision step S1 includes, for example, an environmental change measurement step S11, a calculation parameter setting step S12, and an optimization calculation step S13.

　環境変化量測定ステップＳ１１では、行動決定時（行動決定ステップＳ１）の実環境２を測定し、学習時（学習ステップＳ３）の実環境２の各パラメータとの変化量を得る。例えば、ロボットアームが動作する温度、湿度、部品の摩耗状態、ロボットアームが動作する床の材質、隣接するロボットアームの動きの変化量を測定する。実環境２の測定は、例えば、センサー等で行われる。測定する各パラメータは、事前に設定してある。環境変化量は、例えば、測定したそれぞれのパラメータの変化量を加算して求められる。各パラメータのうち、重要度の高いパラメータに係数をかけてもよい。 In the environmental change measurement step S11, the real environment 2 at the time of action decision (action decision step S1) is measured, and the amount of change with respect to each parameter of the real environment 2 at the time of learning (learning step S3) is obtained. For example, the temperature at which the robot arm operates, humidity, wear condition of parts, material of the floor on which the robot arm operates, and the amount of change in the movement of an adjacent robot arm are measured. The measurement of the real environment 2 is performed, for example, by a sensor, etc. Each parameter to be measured is set in advance. The amount of environmental change is found, for example, by adding up the amount of change in each measured parameter. Of the parameters, a coefficient may be multiplied to the parameter with the highest importance.

　計算パラメータ設定ステップＳ１２では、環境変化量測定ステップＳ１１で求められた環境変化量に基づいて、計算パラメータを設定する。計算パラメータは、最適化計算の方法によって異なる。例えば、最適化計算を量子アニーリングで行う場合は、量子アニーリングの計算時間は、計算パラメータの一つである。また例えば、最適化計算を遺伝的アルゴリズムで行う場合は、突然変異が生じる確率が、計算パラメータの一つである。また例えば、最適化計算を断熱量子計算で行う場合は、計算に加わる熱等のノイズが計算パラメータの一つである。 In calculation parameter setting step S12, calculation parameters are set based on the amount of environmental change found in environmental change measurement step S11. The calculation parameters differ depending on the method of optimization calculation. For example, when the optimization calculation is performed using quantum annealing, the calculation time of quantum annealing is one of the calculation parameters. In another example, when the optimization calculation is performed using a genetic algorithm, the probability of mutation is one of the calculation parameters. In another example, when the optimization calculation is performed using adiabatic quantum computing, noise such as heat added to the calculation is one of the calculation parameters.

　例えば、環境変化量をシグモイド関数に入力して、０以上１以下の数値（計算パラメータ）に変換してもよい。このような変換により、環境変化量が小さい場合は計算パラメータの変化量を小さくし、環境変化量が大きい場合は計算パラメータの変化量を大きくできる。 For example, the amount of environmental change may be input to a sigmoid function and converted into a numerical value (calculation parameter) between 0 and 1. This type of conversion makes it possible to reduce the amount of change in the calculation parameter when the amount of environmental change is small, and to increase the amount of change in the calculation parameter when the amount of environmental change is large.

　例えば、最適化計算を量子アニーリングで行う場合は、環境変化量が大きい程、量子アニーリングの計算時間を短くする。また例えば、最適化計算を遺伝的アルゴリズムで行う場合は、環境変化量が大きい程、突然変異が生じる確率を高くする。また例えば、最適化計算を断熱量子計算で行う場合は、環境変化量が大きい程、ノイズの発生源を追加する。 For example, when optimization calculations are performed using quantum annealing, the greater the amount of change in the environment, the shorter the quantum annealing calculation time will be. Also, for example, when optimization calculations are performed using a genetic algorithm, the greater the amount of change in the environment, the higher the probability of mutations occurring will be. Also, for example, when optimization calculations are performed using adiabatic quantum computing, the greater the amount of change in the environment, the more sources of noise will be added.

　最適化計算ステップＳ１３では、計算パラメータ設定ステップＳ１２で決定された計算パラメータに従い、最適化計算を行う。最適化計算ステップＳ１３を行うことで、行動ａが決定される。図４に示すように、最適化計算ステップＳ１３で選択される行動ａは、真に最適な行動ａ_０に限られない。 In the optimization calculation step S13, optimization calculation is performed according to the calculation parameters determined in the calculation parameter setting step S12. By performing the optimization calculation step S13, an action a is determined. As shown in Fig. 4, the action a selected in the optimization calculation step S13 is not limited to the truly optimal action _a0 .

　最適化計算ステップＳ１３では、所定の実環境２において真に最適な行動ａ_０以外の行動ａ_１も選択しうる。例えば、計算パラメータのランダム性が小さい程、真に最適な行動ａ_０以外の行動ａ_１を選択する確率は低くなり、計算パラメータのランダム性が大きい程、真に最適な行動ａ_０以外の行動ａ_１を選択する確率が高くなる。最適化計算ステップＳ１３で、真に最適な行動ａ_０以外の行動ａ_１を選択し得ることで、実環境２の変化にも適用できる。 In the optimization calculation step S13, it is possible to select an action _a1 other than the truly optimal action _a0 in a given real environment 2. For example, the smaller the randomness of the calculation parameters, the lower the probability of selecting an action _a1 other than the truly optimal action _a0 , and the greater the randomness of the calculation parameters, the higher the probability of selecting an action _a1 other than the truly optimal action _a0 . By being able to select an action _a1 other than the truly optimal action _a0 in the optimization calculation step S13, it is also possible to adapt to changes in the real environment 2.

　状態変化ステップＳ２は、例えば、行動ステップＳ２１と、動作ステップＳ２２と、を有する。 The state change step S2 includes, for example, a behavior step S21 and an operation step S22.

　行動ステップＳ２１では、方策πに従い決定された行動ａが、実環境２におけるエージェント１に入力される。例えば、ロボットアームの動作を制御する導線のそれぞれに電流を印加することで、ロボットアームに行動ａが入力される。それぞれに印加される電流は、最適化計算に基づいて決定される。方策πは最適化計算モデルで表されている。 In the action step S21, the action a determined according to the strategy π is input to the agent 1 in the real environment 2. For example, the action a is input to the robot arm by applying a current to each of the conductors that control the movement of the robot arm. The current applied to each is determined based on optimization calculations. The strategy π is represented by an optimization calculation model.

　動作ステップＳ２２では、行動ステップＳ２１で決定された行動ａに従って、エージェント１が動作する。例えば、行動ステップＳ２１でロボットアームの動作を制御する導線のそれぞれに電流が印加されることで、ロボットアームが動作する。エージェント１が動作することで、エージェント１の状態Ｓ_ａが変化する。例えば、ロボットアームが動作することで、ロボットアームの状態（例えば、関節の角度）が変化する。実環境２とエージェント１とが相互作用することでエージェント１の状態Ｓ_ａが変化する。 In the action step S22, the agent 1 acts in accordance with the action a determined in the action step S21. For example, in the action step S21, a current is applied to each of the conductors that control the movement of the robot arm, causing the robot arm to act. The action of the agent 1 changes the state S _a of the agent 1. For example, the action of the robot arm changes the state of the robot arm (e.g., the angle of the joint). The interaction between the real environment 2 and the agent 1 changes the state S _a of the agent 1.

　学習ステップＳ３は、例えば、実環境測定ステップＳ３１と報酬計算ステップＳ３２と方策決定ステップＳ３３とを有する。 The learning step S3 includes, for example, a real-environment measurement step S31, a reward calculation step S32, and a strategy determination step S33.

　実環境測定ステップＳ３１では、学習ステップＳ３における実環境２を測定する。学習ステップＳ３における実環境２は、上述の行動決定ステップＳ１における実環境２と必ずしも一致しない。方策を決定する学習ステップＳ３における実環境２を測定しておくことで、学習ステップＳ３と行動決定ステップＳ１との間における実環境２の変化量を求めることができ、計算パラメータを設定できる。実環境測定ステップＳ３１は、報酬計算ステップＳ３２の後に行ってもよい。 In the real environment measurement step S31, the real environment 2 in the learning step S3 is measured. The real environment 2 in the learning step S3 does not necessarily match the real environment 2 in the action decision step S1 described above. By measuring the real environment 2 in the learning step S3 where a strategy is decided, the amount of change in the real environment 2 between the learning step S3 and the action decision step S1 can be obtained, and calculation parameters can be set. The real environment measurement step S31 may be performed after the reward calculation step S32.

　報酬計算ステップＳ３２では、行動ａ後のエージェント１の状態Ｓ_ａに対する報酬ｒを取得する。報酬ｒが高い程、エージェント１の行動ａが適切であったと言える。 In the reward calculation step S32, a reward r for the state S _a of the agent 1 after the action a is obtained. The higher the reward r, the more appropriate the action a of the agent 1 was.

　方策決定ステップＳ３３では、報酬ｒに基づいて、次の方策を決定する。例えば、方策決定ステップＳ３３では、前の行動の結果を反映して、方策πを表す最適化計算のモデルを作り直す。方策πは、各行動の関係性（エージェント１の状態Ｓ_ａにどのような影響を及ぼすか）を表す。報酬ｒが高い場合は、前の方策πが適切なモデルであったと言え、同様のモデルを作成する。報酬ｒが低い場合は、前の方策πが表す関係性が適切ではなかったと言え、前の行動を加味した新たなモデルが作成される。 In the policy decision step S33, the next policy is decided based on the reward r. For example, in the policy decision step S33, the result of the previous action is reflected and a model of the optimization calculation representing the policy π is remade. The policy π represents the relationship between each action (how it affects the state S _a of the agent 1). If the reward r is high, it can be said that the previous policy π was an appropriate model, and a similar model is created. If the reward r is low, it can be said that the relationship represented by the previous policy π was not appropriate, and a new model that takes the previous action into account is created.

　第１実施形態に係る強化学習は、行動決定ステップＳ１と状態変化ステップＳ２と学習ステップＳ３とを繰り返すことで、データの価値を最大化（報酬が最大化）する学習を行う。環境変化量測定ステップＳ１１、計算パラメータ設定ステップＳ１２、実環境測定ステップＳ３１は、これらのステップを繰り返す毎に行わなくてもよい。例えば、複数回ステップを繰り返す毎に、環境変化量測定ステップＳ１１、計算パラメータ設定ステップＳ１２、実環境測定ステップＳ３１を行ってもよいし、ランダムに環境変化量測定ステップＳ１１、計算パラメータ設定ステップＳ１２、実環境測定ステップＳ３１を行ってもよい。 The reinforcement learning according to the first embodiment repeats the action decision step S1, the state change step S2, and the learning step S3 to learn to maximize the value of the data (maximize the reward). The environmental change measurement step S11, the calculation parameter setting step S12, and the real environment measurement step S31 do not have to be performed every time these steps are repeated. For example, the environmental change measurement step S11, the calculation parameter setting step S12, and the real environment measurement step S31 may be performed every time the steps are repeated multiple times, or the environmental change measurement step S11, the calculation parameter setting step S12, and the real environment measurement step S31 may be performed randomly.

　第１実施形態に係る行動決定方法は、実環境２の変化にも対応できる。 The behavior decision-making method according to the first embodiment can also adapt to changes in the real environment 2.

　例えば、方策πが一定の確率分布で表現され、この方策に従って確率的に行動が決定される場合、環境が変化しても方策は変更せず、また、行動が決定される確率も変化しない。そのため、所定の実環境２で強化学習を行い、方策πが規定されると、環境変化が生じても行動決定方法に変化は生じない。この方策πは、実環境２が変化する前においては、最適な行動ａを選択することができるが、実環境２が変化した後においてもその行動ａが最適であるという確証は得られない。 For example, if the policy π is expressed as a certain probability distribution and actions are determined probabilistically according to this policy, the policy will not change even if the environment changes, and the probability that an action will be determined will not change either. Therefore, once reinforcement learning is performed in a specified real environment 2 and a policy π is specified, the method of determining actions will not change even if the environment changes. This policy π can select the optimal action a before the real environment 2 changes, but there is no certainty that action a is optimal even after the real environment 2 changes.

　この方策は、行動決定時と学習時の実環境２が変化したとしても、変化前の実環境２で学習された確率分布に従って確率的に行動ａを選択する。つまり、実環境２が変化することで、真に最適な行動が変化した場合（ａ_０→ａ_０’）でも、ａ_０またはａ_０’が選ばれる確率は方策を決定した時点から変わらない。その結果、エージェント１は、変化後の実環境２において妥当ではない行動ａを選択する可能性が高くなる。 With this policy, even if the real environment 2 changes between when the action is decided and when it is learned, the action a is selected probabilistically according to the probability distribution learned in the real environment 2 before the change. In other words, even if the truly optimal action changes ( _a0 → _a0 ') due to a change in the real environment 2, the probability that _a0 or _a0 ' will be selected does not change from the time the policy was decided. As a result, the agent 1 is more likely to select an action a that is inappropriate in the real environment 2 after the change.

　例えば、ロボットアームの接合部の摩耗度が悪化した場合、ロボットアームの動きがスムーズではなくなる。摩耗前の実環境２で求められた方策に従ってロボットアームの各導線に電流を印加しても、摩耗後の実環境では、ロボットアームが所望の状態に至らないことは起こりうる。 For example, if the degree of wear in the joints of the robot arm worsens, the movement of the robot arm will no longer be smooth. Even if current is applied to each conductor of the robot arm according to the strategy determined in the real environment 2 before wear, it is possible that the robot arm will not reach the desired state in the real environment after wear.

　これに対し、本実施形態に係る行動決定方法は、方策πを最適化計算に適用可能なモデルで表している。このモデルは、学習時の実環境２で設定され、実環境２が変化しても同じ方策πを利用できる。方策πに基づく行動決定は、最適化計算によって行われる。最適化計算の結果求められる行動ａの確率分布は、計算パラメータによって変化する。実環境２が変化していない場合は、真に最適な行動ａ_０が選ばれる確率を高め、実環境２が変化した場合は、方策πにおける真に最適な行動ａ_０が選ばれるか確率を低くし、それ以外の行動ａ_１が選ばれる確率を高くできる。 In contrast, in the behavior decision-making method according to the present embodiment, the policy π is represented by a model applicable to optimization calculation. This model is set in the real environment 2 during learning, and the same policy π can be used even if the real environment 2 changes. Behavior decision based on the policy π is performed by optimization calculation. The probability distribution of the behavior a obtained as a result of the optimization calculation changes depending on the calculation parameters. When the real environment 2 has not changed, the probability that the truly optimal behavior a0 in the policy _π is selected can be increased, and when the real environment 2 has changed, the probability that the truly optimal behavior _a0 in the policy π is selected or is reduced, and the probability that another behavior _a1 is selected can be increased.

　例えば、ロボットアームの接合部の摩耗度が悪化した場合、摩耗前の実環境２で最適とされた条件以外の条件で、ロボットアームの各導線に電流を印加するということが行いうる。例えば、本実施形態に係る行動決定方法では、方策πに従って、摩耗前の実環境２で最適とされた条件より多くの電流量をロボットアームの各導線に印加するという選択も行いうる。実際には、実環境２が変化した後の摩耗後の状態では、摩耗前の実環境２で最適とされた条件より多くの電流量をロボットアームの各導線に印加するという選択が適切である。 For example, if the degree of wear at the joints of the robot arm worsens, it is possible to apply current to each conductor of the robot arm under conditions other than those that were considered optimal in the real environment 2 before wear. For example, in the behavior decision-making method according to this embodiment, it is also possible to select, in accordance with the strategy π, to apply to each conductor of the robot arm a greater amount of current than was considered optimal in the real environment 2 before wear. In reality, in a post-wear state after the real environment 2 has changed, it is appropriate to select to apply to each conductor of the robot arm a greater amount of current than was considered optimal in the real environment 2 before wear.

　行動決定する際の実環境２は、学習により方策πを求めた際の実環境２と一致するとは限られない。それにも関わらず、行動決定は、前の実環境２で学習された方策πに基づいて決定される。本実施形態に係る行動決定方法では、行動決定のランダム性を変更することで、変更前の実環境２において最適とされる行動ａ_０以外の行動を選択する確率が高くなり、実環境２の変化に対応することができる。そのため、本実施形態に係る行動決定方法を用いると、新しい実環境２に適応した行動を選択しやすくなり、これまでの学習の経過を活用して学習を進めることができる。 The real environment 2 when deciding on an action does not necessarily match the real environment 2 when the policy π was obtained by learning. Nevertheless, the action is decided based on the policy π learned in the previous real environment 2. In the action decision method according to this embodiment, by changing the randomness of the action decision, the probability of selecting an action other than the action _a0 that is optimal in the real environment 2 before the change increases, and it is possible to respond to changes in the real environment 2. Therefore, when the action decision method according to this embodiment is used, it becomes easier to select an action adapted to the new real environment 2, and learning can be advanced by utilizing the progress of learning so far.

　図７は、第１実施形態に係る行動決定装置１０のブロック図である。行動決定装置１０は、例えば、学習部１１と第１制御部１２と第２制御部１３とメモリ１４と計測部１５と送受信部１６とを備える。 FIG. 7 is a block diagram of the behavior decision device 10 according to the first embodiment. The behavior decision device 10 includes, for example, a learning unit 11, a first control unit 12, a second control unit 13, a memory 14, a measurement unit 15, and a transmission/reception unit 16.

　学習部１１は、行動ａ後のエージェント１の状態Ｓ_ａを報酬ｒとして取得し、報酬ｒに基づいて、次の方策πを決定する。学習部１１は、報酬ｒが最大化するように学習を行う。学習部１１は、例えば、行動ａと、その行動ａの結果としてのエージェント１の状態Ｓ_ａをメモリ１４から入手する。学習部１１は、入手した行動ａと状態Ｓ_ａとに基づいて、報酬ｒを求める。学習部１１は、報酬ｒに基づいて方策πを決定する。 The learning unit 11 acquires the state S _a of the agent 1 after the action a as a reward r, and determines the next policy π based on the reward r. The learning unit 11 performs learning so as to maximize the reward r. The learning unit 11 acquires, for example, the action a and the state S _a of the agent 1 as a result of the action a from the memory 14. The learning unit 11 obtains the reward r based on the acquired action a and state S _a . The learning unit 11 determines the policy π based on the reward r.

　学習部１１は、例えば、演算器（ＣＰＵ）を有する。学習部１１は、例えば、学習ステップＳ３を行う。 The learning unit 11 has, for example, a computing unit (CPU). The learning unit 11 performs, for example, learning step S3.

　第１制御部１２は、最適化計算に適用可能なモデルで表された方策に基づいて、最適化計算を実行し、エージェントの行動を決定する。第１制御部１２は、例えば、行動決定ステップＳ１を行う。 The first control unit 12 executes an optimization calculation based on a strategy represented by a model applicable to the optimization calculation, and determines the behavior of the agent. The first control unit 12 performs, for example, the behavior determination step S1.

　学習部１１で決定された方策πは、第１制御部１２に伝えられる。第１制御部１２は、計測部１５で測定された実環境２の変化量に基づいて、計算パラメータを設定する。そして第１制御部１２は、設定された計算パラメータで、最適化計算を実行し、エージェント１の行動ａを決定する。第１制御部１２は、例えば、演算器（ＣＰＵ）を有する。 The strategy π determined by the learning unit 11 is transmitted to the first control unit 12. The first control unit 12 sets calculation parameters based on the amount of change in the real environment 2 measured by the measurement unit 15. The first control unit 12 then executes optimization calculations with the set calculation parameters and determines the action a of the agent 1. The first control unit 12 has, for example, a computing unit (CPU).

　第２制御部１３は、行動ａを実環境２に入力し、エージェント１の状態Ｓ_ａを変化させる。第２制御部１３は、エージェント１に行動ａを指示する。第２制御部１３は、例えば、送受信部１６を介して、エージェント１に指示を行う。第２制御部１３は、例えば、演算器（ＣＰＵ）を有する。 The second control unit 13 inputs the action a into the real environment 2, and changes the state S _a of the agent 1. The second control unit 13 instructs the agent 1 to perform the action a. The second control unit 13 instructs the agent 1 via, for example, the transmitting/receiving unit 16. The second control unit 13 has, for example, a computing unit (CPU).

　メモリ１４は、学習データ、行動決定方法に従うプログラム、環境変化の情報を記憶する。学習データは、例えば、エージェントの行動ａ、エージェント１の行動ａの結果としてのエージェント１の状態Ｓ_ａ、状態Ｓ_ａの報酬ｒ等である。 The memory 14 stores learning data, a program according to the behavior decision method, and information on environmental changes. The learning data includes, for example, an agent's behavior a, a state S _a of the agent 1 as a result of the agent's behavior a, and a reward r for the state S _a .

　計測部１５は、例えば、センサーである。計測部１５は、実環境２を測定する。計測部１５は、例えば、実環境測定ステップＳ３１、環境変化量測定ステップＳ１１を行う際に用いられる。計測部１５は、例えば、実環境２を表す各パラメータを測定する。 The measurement unit 15 is, for example, a sensor. The measurement unit 15 measures the real environment 2. The measurement unit 15 is used, for example, when performing the real environment measurement step S31 and the environmental change amount measurement step S11. The measurement unit 15 measures, for example, each parameter representing the real environment 2.

　送受信部１６は、例えば、第２制御部１３の指示に従い、行動ａを装置に伝える。例えば、送受信部１６は、ロボットアームの制御部に方策πに基づく行動ａを送信する。また送受信部１６は、行動ａの結果として、エージェント１の状態Ｓ_ａを受信する。送受信部１６は、有線でも無線でもよい。送受信部１６は、状態変化ステップＳ２を司る。報酬計算ステップＳ３２で得られた報酬ｒは、例えば、送受信部１６を介してメモリ１４に記憶される。 The transmitting/receiving unit 16 transmits the action a to the device, for example, according to an instruction from the second control unit 13. For example, the transmitting/receiving unit 16 transmits the action a based on the measure π to the control unit of the robot arm. The transmitting/receiving unit 16 also receives the state S _a of the agent 1 as a result of the action a. The transmitting/receiving unit 16 may be wired or wireless. The transmitting/receiving unit 16 is responsible for the state change step S2. The reward r obtained in the reward calculation step S32 is stored in the memory 14, for example, via the transmitting/receiving unit 16.

　第１実施形態に係る行動決定装置１０は、上述の行動決定方法に従って動作するため、実環境２の変化にも対応することができる。 The behavior decision-making device 10 according to the first embodiment operates according to the behavior decision-making method described above, and is therefore also capable of responding to changes in the real environment 2.

　以上、本発明の実施形態について図面を参照して詳述したが、各実施形態における各構成及びそれらの組み合わせ等は一例であり、本発明の趣旨から逸脱しない範囲内で、構成の付加、省略、置換、及びその他の変更が可能である。 The above describes the embodiments of the present invention in detail with reference to the drawings, but each configuration and their combinations in each embodiment are merely examples, and additions, omissions, substitutions, and other modifications of configurations are possible without departing from the spirit of the present invention.

１　エージェント
２　実環境
１０　行動決定装置
１１　学習部
１２　第１制御部
１３　第２制御部
１４　メモリ
１５　計測部
１６　送受信部
ａ　行動
ｒ　報酬
π　方策
Ｓ_ａ　状態
Ｓ１　行動決定ステップ
Ｓ１１　環境変化量測定ステップ
Ｓ１２　計算パラメータ設定ステップ
Ｓ１３　最適化計算ステップ
Ｓ２　状態変化ステップ
Ｓ２１　行動ステップ
Ｓ２２　動作ステップ
Ｓ３　学習ステップ
Ｓ３１　実環境測定ステップ
Ｓ３２　報酬計算ステップ
Ｓ３３　方策決定ステップ REFERENCE SIGNS LIST 1 Agent 2 Real environment 10 Action decision device 11 Learning unit 12 First control unit 13 Second control unit 14 Memory 15 Measurement unit 16 Transmitting/receiving unit a Action r Reward π Policy S _a State S1 Action decision step S11 Environment change amount measurement step S12 Calculation parameter setting step S13 Optimization calculation step S2 State change step S21 Action step S22 Operation step S3 Learning step S31 Real environment measurement step S32 Reward calculation step S33 Policy decision step

Claims

an action decision step of performing an optimization calculation and deciding an action of an agent based on a policy represented by a model applicable to the optimization calculation;
a state change step in which the behavior is input into a real environment, whereby the real environment and the agent interact with each other and a state of the agent changes;
A learning step of acquiring a value of the state of the agent after the action as a reward and determining a next policy based on the reward;
The optimization calculation can change randomness by a calculation parameter;
The behavior determination method includes changing the calculation parameters based on an amount of change in the real environment between the learning step and the behavior determination step.

The behavior decision-making method according to claim 1, wherein the optimization calculation is an adiabatic quantum calculation performed on a quantum computer.

The optimization calculation is quantum annealing,
The method of claim 1 , wherein the model is an Ising model or a QUBO model.

The behavior decision-making method according to claim 1, wherein the optimization calculation is performed using a genetic algorithm.

A learning unit, a first control unit, and a second control unit,
The first control unit executes an optimization calculation based on a measure represented by a model applicable to the optimization calculation, and determines an action of an agent;
the second control unit inputs the action into a real environment and changes a state of the agent;
the learning unit acquires a state of the agent after performing the action as a reward, and determines a next policy based on the reward;
The optimization calculation can change randomness by a calculation parameter;
The calculation parameters are changed based on an amount of change in the real environment between when the agent's action is determined and when the strategy is determined.