[go: up one dir, main page]

WO2024224501A1 - Action determination method and action determination device - Google Patents

Action determination method and action determination device Download PDF

Info

Publication number
WO2024224501A1
WO2024224501A1 PCT/JP2023/016404 JP2023016404W WO2024224501A1 WO 2024224501 A1 WO2024224501 A1 WO 2024224501A1 JP 2023016404 W JP2023016404 W JP 2023016404W WO 2024224501 A1 WO2024224501 A1 WO 2024224501A1
Authority
WO
WIPO (PCT)
Prior art keywords
action
agent
calculation
real environment
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/JP2023/016404
Other languages
French (fr)
Japanese (ja)
Inventor
舞 竹内
海図 浅井
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TDK Corp
Original Assignee
TDK Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TDK Corp filed Critical TDK Corp
Priority to PCT/JP2023/016404 priority Critical patent/WO2024224501A1/en
Publication of WO2024224501A1 publication Critical patent/WO2024224501A1/en
Anticipated expiration legal-status Critical
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates to a behavior decision-making method and a behavior decision-making device.
  • Machine learning can be broadly divided into supervised learning, unsupervised learning, and reinforcement learning. Reinforcement learning is said to be suitable for learning autonomous driving and behavior optimization, and is attracting attention.
  • Reinforcement learning is a learning method in which an agent executes a task by repeatedly interacting with the environment and through trial and error. Interaction refers to the sending and receiving of information between the agent and the environment.
  • the agent and the environment send and receive information on state, action, and reward.
  • the state is the situation the agent is in, and the action is the agent's behavior.
  • the reward is an evaluation index for the agent's state after the action.
  • the agent inputs actions based on the policy to the environment.
  • the environment inputs the state and reward to the agent according to the agent's actions.
  • the agent and environment repeatedly interact through trial and error to maximize the reward.
  • Patent Document 1 discloses a reinforcement learning method that applies optimization calculations to the reward function.
  • the policy is expressed by a probability distribution function ( ⁇ (a
  • s) exp(r a (s))/Z R ) with the distribution function Z R as the denominator.
  • the agent decides on an action according to the policy. If the policy is specified by a probability distribution, the randomness of the action decision is uniquely determined.
  • the real environment does not change, there is no problem in deciding the action according to the learned policy, but when the real environment changes, the action is decided based on the policy learned in the real environment before the change. In this case, the action decision in the real environment after the change is made based on the policy learned in the real environment before the change, and the validity of the action decision may be reduced. Therefore, the reinforcement learning method described in Patent Document 1 needs to re-learn every time the real environment changes, and cannot adequately respond to changes in the real environment.
  • the present invention has been made in consideration of the above circumstances, and aims to provide a behavior decision-making method and device that can increase the probability of selecting an action that is not truly optimal during learning by changing the randomness of behavior decisions in response to changes in the real environment, and can increase the possibility of selecting an action that is appropriate for a changed environment, even if the environment changes.
  • the present invention provides the following means.
  • the behavior decision method includes a behavior decision step, a state change step, and a learning step.
  • the behavior decision step an optimization calculation is performed based on a measure represented by a model applicable to the optimization calculation, and an agent's behavior is decided.
  • the state change step the behavior is input into a real environment, causing an interaction between the real environment and the agent, and changing the state of the agent.
  • the learning step the value of the agent's state after the behavior is obtained as a reward, and the next measure is decided based on the reward.
  • the optimization calculation can change randomness by a calculation parameter. The calculation parameter is changed based on the amount of change in the real environment between the learning step and the behavior decision step.
  • the behavior decision device has a learning unit, a first control unit, and a second control unit.
  • the first control unit executes an optimization calculation and decides an agent's behavior based on a measure represented by a model applicable to the optimization calculation.
  • the second control unit inputs the behavior into a real environment and changes the state of the agent.
  • the learning unit obtains the state of the agent after the behavior as a reward, and decides the next measure based on the reward.
  • the optimization calculation can change randomness by a calculation parameter.
  • the calculation parameter is changed based on the amount of change in the real environment between when the agent's behavior is decided and when the measure is decided.
  • the behavior decision-making method and behavior decision-making device of the present invention can be adapted to environmental changes by changing the randomness of behavior decisions in response to changes in the environment.
  • FIG. 2 is a conceptual diagram of reinforcement learning that is a component of the behavior decision-making method according to the first embodiment.
  • FIG. 2 is a conceptual diagram showing state changes in reinforcement learning that is a part of the behavior decision-making method according to the first embodiment. This is an illustration of the probability distribution of solutions obtained by optimization calculations using quantum annealing. This is an image of the probability distribution of actions selected by expressing the policy as an optimization calculation model and performing optimization calculations. This is an image diagram showing the relationship between the action selected by an agent based on a policy and changes in the real environment.
  • FIG. 2 is an example of a flow diagram of reinforcement learning according to the first embodiment.
  • 1 is a block diagram of a behavior determination device according to a first embodiment.
  • FIG. 1 is a conceptual diagram of reinforcement learning that carries out the behavior decision-making method according to the first embodiment.
  • an agent 1 and a real environment 2 learn by interacting with each other, and a reward r is obtained according to a state S a of the agent 1 after an action a.
  • the action a of the agent 1 is determined according to a policy ⁇ , and the agent 1 performs the determined action a.
  • the action a of the agent 1 can be represented, for example, by an action vector (a 1 , a 2 , a 3 , ..., a n ).
  • the state S a of the agent 1 changes depending on the action a.
  • the state S a of the agent 1 can be represented, for example, by a state vector (S a0 , S a1 , S a2 , ..., S an ).
  • agent 1 is the control unit of the robot arm.
  • the real environment 2 is the environment in which the robot arm operates.
  • the real environment 2 is, for example, the temperature and humidity at which the robot arm operates, the wear state of parts, the material of the floor on which the robot arm operates, the movement of adjacent robot arms, etc.
  • action a of agent 1 corresponds to the action of passing a current through the conductors.
  • one action a of the robot arm is determined by specifying how many A of current should be passed through each of the n conductors.
  • a 1 is an option of how many amperes to pass through the first conductor
  • a 2 is an option of how many amperes to pass through the second conductor.
  • there are Kn options for the current to pass through the l-th conductor there are options to apply any one of the currents ⁇ I l 1 , I l 2 , ..., I l Kl ⁇ to the l-th conductor.
  • the policy ⁇ that determines the behavior of agent 1 is, for example, a guideline for controlling a robot arm. Based on the guideline for controlling the robot arm (policy ⁇ ), the control of the robot arm (action a) is determined.
  • the state S a of the agent 1 changes.
  • the state S a of the agent 1 is, for example, the angle of the joint of a robot arm. For example, when a current is applied to each of n conductors connected to the robot arm, the angle of the joint of the robot arm changes, and the state S a changes.
  • the reward r is, for example, a value obtained by changing the state S a of the robot arm. For example, if the precise control of the robot arm is to bend the arm by 30 degrees, the closer the state S a of the robot arm caused by the action a of the agent 1 is to a state bent by 30 degrees, the higher the reward r will be.
  • FIG. 2 is a conceptual diagram showing state changes in reinforcement learning, which is the behavior decision-making method according to the first embodiment.
  • reinforcement learning the state of agent 1 transitions as a result of learning.
  • Each of S a0 , S a1 , S a2 , S a3 , and S a4 in FIG. 2 represents a state of the agent 1.
  • the state of the agent 1 is selected by applying an optimization calculation to the policy ⁇ from the initial state S a0 , and transitions to, for example, any one of S a0 , S a1 , S a2 , and S a3 according to the result. For example, in the case of control of a robot arm, transitions from the initial state S a0 in which the joints are not bent to any one of the next states S a0 , S a1 , S a2 , and S a3 .
  • the next state may remain the original state S a0 in which the joints are not bent, or may be any one of the other states S a1 , S a2 , and S a3 in which the joints are bent.
  • the state to which the agent transitions is determined based on the policy ⁇ . For example, when transitioning to the state S a2 in which the joints are bent, transitions to another state S a0 , S a3 , and S a4 depending on the next action.
  • reinforcement learning for example, a reward r is calculated for each action a that produces a state transition. Alternatively, the reward r may be calculated after selecting an action and executing the action multiple times according to the same policy (for each episode).
  • the policy ⁇ is represented by a model that can be calculated by optimization. Then, an action a is determined based on the policy ⁇ by performing an optimization calculation. The randomness of the optimization calculation can be changed by calculation parameters.
  • the optimization calculations may be performed, for example, using adiabatic quantum computing performed on a quantum computer, quantum annealing, or a genetic algorithm.
  • an Ising model or QUBO can be used as a model applicable to the optimization calculation.
  • any Hermitian matrix can be used as a model applicable to the optimization calculation.
  • any real-valued function can be used as a model applicable to the optimization calculation.
  • the policy ⁇ is expressed by the Ising model
  • the policy ⁇ is expressed by the following equation (1).
  • S a ) is an optimization calculation model, and represents the relationship between each element of the behavior vector a.
  • ⁇ i and ⁇ j are input variables, and each have two values, +1 or -1.
  • h ij is an interaction parameter.
  • h ij is expressed, for example, as a function of the state S a of agent 1.
  • the probability of finding the true optimal value can vary depending on the calculation time. For example, the optimal value found after performing calculations over a nearly infinite period of time will not necessarily be the same as the optimal value found after performing calculations over a short period of time.
  • FIG. 3 is an image diagram of the probability distribution of a solution obtained by an optimization calculation using quantum annealing.
  • (a) in FIG. 3 is a probability distribution of a solution output after performing an optimization calculation for a long time
  • (b) is a probability distribution of a solution output after performing an optimization calculation for a short time.
  • the true optimal value is v 0
  • the probability distribution is expressed, for example, by a normal distribution.
  • Figure 4 shows an image of the probability distribution of action a selected by performing optimization calculations, with the policy ⁇ represented as an optimization calculation model.
  • Figure 4 (a) shows the probability distribution of action a selected after performing optimization calculations based on the policy ⁇ for a long period of time, and (b) shows the probability distribution of action a selected after performing optimization calculations based on the policy ⁇ for a short period of time.
  • the probability of selecting an action a1 other than the optimal action a0 changes depending on the calculation time of the optimization calculation. For example, the longer the calculation time, the lower the probability of selecting an action a1 other than the optimal action a0 , and the shorter the calculation time, the higher the probability of selecting an action a1 other than the optimal action a0 .
  • Changing the calculation time in quantum annealing can change the randomness of the optimization calculation.
  • the optimal action may change when the real environment 2 changes. For example, even if the action a0 was optimal in the real environment 2 before the change, the action a0 ' may become optimal when the real environment 2 changes.
  • the randomness of the optimization calculation is changed based on the amount of change in the real environment.
  • the randomness of the optimization calculation can be changed by changing the calculation parameters.
  • the calculation time of the optimization calculation is one of the calculation parameters.
  • the probability of mutations occurring is one of the calculation parameters that contribute to the randomness of the optimization calculation.
  • the randomness of the optimization calculation can be changed by changing the mutation parameters.
  • noise such as heat applied to the calculator is one of the calculation parameters that contribute to randomness.
  • FIG. 6 is an example of a flow diagram of reinforcement learning according to the first embodiment.
  • the behavior decision method according to the first embodiment has a behavior decision step S1, a state change step S2, and a learning step S3.
  • an optimization calculation is performed based on a policy ⁇ represented by a model applicable to the optimization calculation, and an action a of the agent 1 is determined.
  • the action a is input to the real environment 2, whereby the real environment 2 and the agent 1 interact with each other, changing the state of the agent 1.
  • the learning step S3 the value of the state S a of the agent 1 after the action a is obtained as a reward r, and the next policy ⁇ is determined based on the reward r.
  • the action decision step S1 includes, for example, an environmental change measurement step S11, a calculation parameter setting step S12, and an optimization calculation step S13.
  • the real environment 2 at the time of action decision (action decision step S1) is measured, and the amount of change with respect to each parameter of the real environment 2 at the time of learning (learning step S3) is obtained.
  • the temperature at which the robot arm operates humidity, wear condition of parts, material of the floor on which the robot arm operates, and the amount of change in the movement of an adjacent robot arm are measured.
  • the measurement of the real environment 2 is performed, for example, by a sensor, etc.
  • Each parameter to be measured is set in advance.
  • the amount of environmental change is found, for example, by adding up the amount of change in each measured parameter.
  • a coefficient may be multiplied to the parameter with the highest importance.
  • calculation parameters are set based on the amount of environmental change found in environmental change measurement step S11.
  • the calculation parameters differ depending on the method of optimization calculation. For example, when the optimization calculation is performed using quantum annealing, the calculation time of quantum annealing is one of the calculation parameters. In another example, when the optimization calculation is performed using a genetic algorithm, the probability of mutation is one of the calculation parameters. In another example, when the optimization calculation is performed using adiabatic quantum computing, noise such as heat added to the calculation is one of the calculation parameters.
  • the amount of environmental change may be input to a sigmoid function and converted into a numerical value (calculation parameter) between 0 and 1.
  • This type of conversion makes it possible to reduce the amount of change in the calculation parameter when the amount of environmental change is small, and to increase the amount of change in the calculation parameter when the amount of environmental change is large.
  • optimization calculations are performed using adiabatic quantum computing the greater the amount of change in the environment, the more sources of noise will be added.
  • optimization calculation step S13 optimization calculation is performed according to the calculation parameters determined in the calculation parameter setting step S12. By performing the optimization calculation step S13, an action a is determined. As shown in Fig. 4, the action a selected in the optimization calculation step S13 is not limited to the truly optimal action a0 .
  • the optimization calculation step S13 it is possible to select an action a1 other than the truly optimal action a0 in a given real environment 2.
  • the smaller the randomness of the calculation parameters the lower the probability of selecting an action a1 other than the truly optimal action a0 , and the greater the randomness of the calculation parameters, the higher the probability of selecting an action a1 other than the truly optimal action a0 .
  • the state change step S2 includes, for example, a behavior step S21 and an operation step S22.
  • the action a determined according to the strategy ⁇ is input to the agent 1 in the real environment 2.
  • the action a is input to the robot arm by applying a current to each of the conductors that control the movement of the robot arm.
  • the current applied to each is determined based on optimization calculations.
  • the strategy ⁇ is represented by an optimization calculation model.
  • the agent 1 acts in accordance with the action a determined in the action step S21.
  • a current is applied to each of the conductors that control the movement of the robot arm, causing the robot arm to act.
  • the action of the agent 1 changes the state S a of the agent 1.
  • the action of the robot arm changes the state of the robot arm (e.g., the angle of the joint).
  • the interaction between the real environment 2 and the agent 1 changes the state S a of the agent 1.
  • the learning step S3 includes, for example, a real-environment measurement step S31, a reward calculation step S32, and a strategy determination step S33.
  • the real environment 2 in the learning step S3 is measured.
  • the real environment 2 in the learning step S3 does not necessarily match the real environment 2 in the action decision step S1 described above.
  • the amount of change in the real environment 2 between the learning step S3 and the action decision step S1 can be obtained, and calculation parameters can be set.
  • the real environment measurement step S31 may be performed after the reward calculation step S32.
  • a reward r for the state S a of the agent 1 after the action a is obtained.
  • the next policy is decided based on the reward r.
  • the result of the previous action is reflected and a model of the optimization calculation representing the policy ⁇ is remade.
  • the policy ⁇ represents the relationship between each action (how it affects the state S a of the agent 1). If the reward r is high, it can be said that the previous policy ⁇ was an appropriate model, and a similar model is created. If the reward r is low, it can be said that the relationship represented by the previous policy ⁇ was not appropriate, and a new model that takes the previous action into account is created.
  • the reinforcement learning according to the first embodiment repeats the action decision step S1, the state change step S2, and the learning step S3 to learn to maximize the value of the data (maximize the reward).
  • the environmental change measurement step S11, the calculation parameter setting step S12, and the real environment measurement step S31 do not have to be performed every time these steps are repeated.
  • the environmental change measurement step S11, the calculation parameter setting step S12, and the real environment measurement step S31 may be performed every time the steps are repeated multiple times, or the environmental change measurement step S11, the calculation parameter setting step S12, and the real environment measurement step S31 may be performed randomly.
  • the behavior decision-making method according to the first embodiment can also adapt to changes in the real environment 2.
  • the policy ⁇ is expressed as a certain probability distribution and actions are determined probabilistically according to this policy, the policy will not change even if the environment changes, and the probability that an action will be determined will not change either. Therefore, once reinforcement learning is performed in a specified real environment 2 and a policy ⁇ is specified, the method of determining actions will not change even if the environment changes.
  • This policy ⁇ can select the optimal action a before the real environment 2 changes, but there is no certainty that action a is optimal even after the real environment 2 changes.
  • the action a is selected probabilistically according to the probability distribution learned in the real environment 2 before the change.
  • the probability that a0 or a0 ' will be selected does not change from the time the policy was decided.
  • the agent 1 is more likely to select an action a that is inappropriate in the real environment 2 after the change.
  • the policy ⁇ is represented by a model applicable to optimization calculation.
  • This model is set in the real environment 2 during learning, and the same policy ⁇ can be used even if the real environment 2 changes.
  • Behavior decision based on the policy ⁇ is performed by optimization calculation.
  • the probability distribution of the behavior a obtained as a result of the optimization calculation changes depending on the calculation parameters.
  • the real environment 2 when deciding on an action does not necessarily match the real environment 2 when the policy ⁇ was obtained by learning. Nevertheless, the action is decided based on the policy ⁇ learned in the previous real environment 2.
  • the action decision method according to this embodiment by changing the randomness of the action decision, the probability of selecting an action other than the action a0 that is optimal in the real environment 2 before the change increases, and it is possible to respond to changes in the real environment 2. Therefore, when the action decision method according to this embodiment is used, it becomes easier to select an action adapted to the new real environment 2, and learning can be advanced by utilizing the progress of learning so far.
  • FIG. 7 is a block diagram of the behavior decision device 10 according to the first embodiment.
  • the behavior decision device 10 includes, for example, a learning unit 11, a first control unit 12, a second control unit 13, a memory 14, a measurement unit 15, and a transmission/reception unit 16.
  • the learning unit 11 acquires the state S a of the agent 1 after the action a as a reward r, and determines the next policy ⁇ based on the reward r.
  • the learning unit 11 performs learning so as to maximize the reward r.
  • the learning unit 11 acquires, for example, the action a and the state S a of the agent 1 as a result of the action a from the memory 14.
  • the learning unit 11 obtains the reward r based on the acquired action a and state S a .
  • the learning unit 11 determines the policy ⁇ based on the reward r.
  • the learning unit 11 has, for example, a computing unit (CPU).
  • the learning unit 11 performs, for example, learning step S3.
  • the first control unit 12 executes an optimization calculation based on a strategy represented by a model applicable to the optimization calculation, and determines the behavior of the agent.
  • the first control unit 12 performs, for example, the behavior determination step S1.
  • the strategy ⁇ determined by the learning unit 11 is transmitted to the first control unit 12.
  • the first control unit 12 sets calculation parameters based on the amount of change in the real environment 2 measured by the measurement unit 15.
  • the first control unit 12 then executes optimization calculations with the set calculation parameters and determines the action a of the agent 1.
  • the first control unit 12 has, for example, a computing unit (CPU).
  • the second control unit 13 inputs the action a into the real environment 2, and changes the state S a of the agent 1.
  • the second control unit 13 instructs the agent 1 to perform the action a.
  • the second control unit 13 instructs the agent 1 via, for example, the transmitting/receiving unit 16.
  • the second control unit 13 has, for example, a computing unit (CPU).
  • the memory 14 stores learning data, a program according to the behavior decision method, and information on environmental changes.
  • the learning data includes, for example, an agent's behavior a, a state S a of the agent 1 as a result of the agent's behavior a, and a reward r for the state S a .
  • the measurement unit 15 is, for example, a sensor.
  • the measurement unit 15 measures the real environment 2.
  • the measurement unit 15 is used, for example, when performing the real environment measurement step S31 and the environmental change amount measurement step S11.
  • the measurement unit 15 measures, for example, each parameter representing the real environment 2.
  • the transmitting/receiving unit 16 transmits the action a to the device, for example, according to an instruction from the second control unit 13. For example, the transmitting/receiving unit 16 transmits the action a based on the measure ⁇ to the control unit of the robot arm. The transmitting/receiving unit 16 also receives the state S a of the agent 1 as a result of the action a.
  • the transmitting/receiving unit 16 may be wired or wireless.
  • the transmitting/receiving unit 16 is responsible for the state change step S2.
  • the reward r obtained in the reward calculation step S32 is stored in the memory 14, for example, via the transmitting/receiving unit 16.
  • the behavior decision-making device 10 operates according to the behavior decision-making method described above, and is therefore also capable of responding to changes in the real environment 2.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Feedback Control In General (AREA)

Abstract

This action determination method includes an action determination step, a state change step, and a learning step. In the action determination step, an optimization calculation is executed on the basis of a policy represented by a model applicable to the optimization calculation, and the action of an agent is determined. In the state change step, the action is input into an actual environment, whereby the actual environment and the agent interact, and the state of the agent changes. In the learning step, a value for the state of the agent after the action is acquired as a reward, and the next policy is determined on the basis of the reward. The optimization calculation can modify randomness by using a calculation parameter. The calculation parameter is modified on the basis of a change amount of the actual environment between the learning step and the action determination step.

Description

行動決定方法及び行動決定装置Behavior determination method and behavior determination device

 本発明は、行動決定方法及び行動決定装置に関する。 The present invention relates to a behavior decision-making method and a behavior decision-making device.

 機械学習は、教師あり学習と、教師なし学習と、強化学習とに、大別できる。強化学習は、自動運転、動作の最適化等の学習に適していると言われており、注目されている。 Machine learning can be broadly divided into supervised learning, unsupervised learning, and reinforcement learning. Reinforcement learning is said to be suitable for learning autonomous driving and behavior optimization, and is attracting attention.

 強化学習は、エージェントと環境とが相互作用を繰り返し、試行錯誤を行うことで、タスクを実行する学習方法である。相互作用は、エージェントと環境とが互いに情報を送受信し合うことをいう。 Reinforcement learning is a learning method in which an agent executes a task by repeatedly interacting with the environment and through trial and error. Interaction refers to the sending and receiving of information between the agent and the environment.

 強化学習において、エージェントと環境とは、状態、行動、報酬の情報を送受信し合う。状態は、エージェントが置かれている状況であり、行動は、エージェントの振る舞いである。報酬は、行動後のエージェントの状態に対する評価指標である。エージェントは、方策に基づいた行動を環境に入力する。環境は、エージェントの行動に応じて、状態と報酬をエージェントに入力する。エージェントと環境とは、報酬を最大限得られるように、相互作用を繰り返し試行錯誤する。 In reinforcement learning, the agent and the environment send and receive information on state, action, and reward. The state is the situation the agent is in, and the action is the agent's behavior. The reward is an evaluation index for the agent's state after the action. The agent inputs actions based on the policy to the environment. The environment inputs the state and reward to the agent according to the agent's actions. The agent and environment repeatedly interact through trial and error to maximize the reward.

 例えば、特許文献1には、報酬関数に最適化計算を適用した強化学習方法が開示されている。 For example, Patent Document 1 discloses a reinforcement learning method that applies optimization calculations to the reward function.

特許第7111178号公報Patent No. 7111178

 特許文献1に記載の強化学習方法では、分配関数ZRを分母とした確率分布関数(π(a|s)=exp(r(s))/Z)で方策を表現している。エージェントは方策に従って行動を決定する。方策を確率分布で規定すると、行動決定のランダム性が一意に決まってしまう。実環境が変化しない場合は、学習された方策に従って行動を決定しても問題ないが、実環境が変化する場合は、変化前の実環境で学習された方策を基に行動が決定されてしまう。この場合、変化後の実環境での行動決定を、変化前の実環境で学習された方策に基づいて行うことになり、行動決定の妥当性が低くなる場合がある。そのため、特許文献1に記載の強化学習方法は、実環境が変化する毎に学習をし直す必要があり、実環境の変化に十分対応することができない。 In the reinforcement learning method described in Patent Document 1, the policy is expressed by a probability distribution function (π(a|s)=exp(r a (s))/Z R ) with the distribution function Z R as the denominator. The agent decides on an action according to the policy. If the policy is specified by a probability distribution, the randomness of the action decision is uniquely determined. When the real environment does not change, there is no problem in deciding the action according to the learned policy, but when the real environment changes, the action is decided based on the policy learned in the real environment before the change. In this case, the action decision in the real environment after the change is made based on the policy learned in the real environment before the change, and the validity of the action decision may be reduced. Therefore, the reinforcement learning method described in Patent Document 1 needs to re-learn every time the real environment changes, and cannot adequately respond to changes in the real environment.

 本発明は上記事情に鑑みてなされたものであり、実環境の変化に応じて行動決定のランダム性を変えることで、学習時に真に最良ではない行動を選択する確率を高め、環境が変化した場合でも、その環境に適した行動を選択する可能性を高めることができる、行動決定方法及び行動決定装置を提供することを目的とする。 The present invention has been made in consideration of the above circumstances, and aims to provide a behavior decision-making method and device that can increase the probability of selecting an action that is not truly optimal during learning by changing the randomness of behavior decisions in response to changes in the real environment, and can increase the possibility of selecting an action that is appropriate for a changed environment, even if the environment changes.

 本発明は、上記課題を解決するため、以下の手段を提供する。 To solve the above problems, the present invention provides the following means.

 第1の態様に係る行動決定方法は、行動決定ステップと、状態変化ステップと、学習ステップと、を有する。行動決定ステップでは、最適化計算に適用可能なモデルで表された方策に基づいて、最適化計算を実行し、エージェントの行動を決定する。状態変化ステップでは、前記行動を実環境に入力することで、前記実環境と前記エージェントとが相互作用し、前記エージェントの状態が変化する。学習ステップでは、行動後の前記エージェントの状態の価値を報酬として取得し、前記報酬に基づいて、次の方策を決定する。前記最適化計算は、計算パラメータによってランダム性を変更できる。前記計算パラメータは、前記学習ステップと前記行動決定ステップとの間における前記実環境の変化量に基づいて変更される。 The behavior decision method according to the first aspect includes a behavior decision step, a state change step, and a learning step. In the behavior decision step, an optimization calculation is performed based on a measure represented by a model applicable to the optimization calculation, and an agent's behavior is decided. In the state change step, the behavior is input into a real environment, causing an interaction between the real environment and the agent, and changing the state of the agent. In the learning step, the value of the agent's state after the behavior is obtained as a reward, and the next measure is decided based on the reward. The optimization calculation can change randomness by a calculation parameter. The calculation parameter is changed based on the amount of change in the real environment between the learning step and the behavior decision step.

 第2の態様に係る行動決定装置は、学習部と、第1制御部と、第2制御部と、を有する。前記第1制御部は、最適化計算に適用可能なモデルで表された方策に基づいて、最適化計算を実行し、エージェントの行動を決定する。前記第2制御部は、前記行動を実環境に入力し、前記エージェントの状態を変化させる。前記学習部は、前記行動後の前記エージェントの状態を報酬として取得し、前記報酬に基づいて、次の方策を決定する。前記最適化計算は、計算パラメータによってランダム性を変更できる。前記計算パラメータは、前記エージェントの行動を決定する際と前記方策を決定する際との間の前記実環境の変化量に基づいて変更される。 The behavior decision device according to the second aspect has a learning unit, a first control unit, and a second control unit. The first control unit executes an optimization calculation and decides an agent's behavior based on a measure represented by a model applicable to the optimization calculation. The second control unit inputs the behavior into a real environment and changes the state of the agent. The learning unit obtains the state of the agent after the behavior as a reward, and decides the next measure based on the reward. The optimization calculation can change randomness by a calculation parameter. The calculation parameter is changed based on the amount of change in the real environment between when the agent's behavior is decided and when the measure is decided.

 本発明に係る行動決定方法及び行動決定装置は、環境の変化に応じて行動決定のランダム性を変えることで、環境変化に適用できる。 The behavior decision-making method and behavior decision-making device of the present invention can be adapted to environmental changes by changing the randomness of behavior decisions in response to changes in the environment.

第1実施形態に係る行動決定方法を担う強化学習の概念図である。FIG. 2 is a conceptual diagram of reinforcement learning that is a component of the behavior decision-making method according to the first embodiment. 第1実施形態に係る行動決定方法を担う強化学習における状態の変化を示す概念図である。FIG. 2 is a conceptual diagram showing state changes in reinforcement learning that is a part of the behavior decision-making method according to the first embodiment. 量子アニーリングを用いた最適化計算によって求められる解の確率分布のイメージ図である。This is an illustration of the probability distribution of solutions obtained by optimization calculations using quantum annealing. 方策を最適化計算モデルで表し、最適化計算を行うことで選択される行動の確率分布のイメージ図である。This is an image of the probability distribution of actions selected by expressing the policy as an optimization calculation model and performing optimization calculations. 方策に基づいてエージェントが選択する行動と、実環境の変化と、の関係を示すイメージ図である。This is an image diagram showing the relationship between the action selected by an agent based on a policy and changes in the real environment. 第1実施形態に係る強化学習のフロー図の一例である。FIG. 2 is an example of a flow diagram of reinforcement learning according to the first embodiment. 第1実施形態に係る行動決定装置のブロック図である。1 is a block diagram of a behavior determination device according to a first embodiment.

 以下、本実施形態について、図面を適宜参照しながら詳細に説明する。以下の説明で用いる図面は、本実施形態の特徴をわかりやすくするために便宜上特徴となる部分を拡大して示している場合があり、各構成要素の寸法比率などは実際とは異なっていることがある。以下の説明において例示される材料、寸法等は一例であって、本発明はそれらに限定されるものではなく、その要旨を変更しない範囲で適宜変更して実施することが可能である。 The present embodiment will be described in detail below with reference to the drawings as appropriate. The drawings used in the following description may show enlarged characteristic parts for the sake of convenience in order to make the features of the present embodiment easier to understand, and the dimensional ratios of each component may differ from the actual ones. The materials, dimensions, etc. exemplified in the following description are merely examples, and the present invention is not limited to them, and may be modified as appropriate within the scope of the present invention.

 図1は、第1実施形態に係る行動決定方法を担う強化学習の概念図である。強化学習は、エージェント1と実環境2とが相互作用することで学習を行い、行動aの後のエージェント1の状態Sに応じた報酬rを求める。エージェント1の行動aは、方策πに従って決定され、エージェント1は、決定された行動aを行う。エージェント1の行動aは、例えば、行動ベクトル(a、a、a、…、a)で表すことができる。エージェント1の状態Sは、行動aによって変化する。エージェント1の状態Sは、例えば、状態ベクトル(Sa0、Sa1、Sa2、…、San)で表すことができる。 FIG. 1 is a conceptual diagram of reinforcement learning that carries out the behavior decision-making method according to the first embodiment. In reinforcement learning, an agent 1 and a real environment 2 learn by interacting with each other, and a reward r is obtained according to a state S a of the agent 1 after an action a. The action a of the agent 1 is determined according to a policy π, and the agent 1 performs the determined action a. The action a of the agent 1 can be represented, for example, by an action vector (a 1 , a 2 , a 3 , ..., a n ). The state S a of the agent 1 changes depending on the action a. The state S a of the agent 1 can be represented, for example, by a state vector (S a0 , S a1 , S a2 , ..., S an ).

 例えば、ロボットアームの制御を例に具体的に説明する。この場合、エージェント1はロボットアームの制御部である。実環境2はロボットアームが動作する環境である。実環境2は、例えば、ロボットアームが動作する温度、湿度、部品の摩耗状態、ロボットアームが動作する床の材質、隣接するロボットアームの動き等である。 For example, a specific explanation will be given using the control of a robot arm as an example. In this case, agent 1 is the control unit of the robot arm. The real environment 2 is the environment in which the robot arm operates. The real environment 2 is, for example, the temperature and humidity at which the robot arm operates, the wear state of parts, the material of the floor on which the robot arm operates, the movement of adjacent robot arms, etc.

 例えば、ロボットアームに接続されたn本の導線に電流を流すことでロボットアームが動作する場合、エージェント1の行動aは、導線に電流を流すという動作に対応する。例えば、n本の導線のそれぞれに何Aの電流を流すかを全て指定することで、ロボットアームの一つの行動aが決定する。 For example, if a robot arm operates by passing a current through n conductors connected to the robot arm, then action a of agent 1 corresponds to the action of passing a current through the conductors. For example, one action a of the robot arm is determined by specifying how many A of current should be passed through each of the n conductors.

 行動ベクトル(a、a、a、…、a)の各要素は、l番目(l=1~nの整数)の導線に何Aを流すかという選択肢に対応する。例えば、aは、1番目の導線に何Aを流すかという選択肢であり、aは、2番目の導線に何Aを流すかという選択肢である。例えば、l番目の導線に流す電流の選択肢がKn通りある場合、l番目の導線に{I 、I 、…、I Kl}のいずれかの電流を印加するという選択肢がある。 Each element of the action vector (a 1 , a 2 , a 3 , ..., a n ) corresponds to an option of how many amperes to pass through the l-th conductor (l=an integer from 1 to n). For example, a 1 is an option of how many amperes to pass through the first conductor, and a 2 is an option of how many amperes to pass through the second conductor. For example, if there are Kn options for the current to pass through the l-th conductor, there are options to apply any one of the currents {I l 1 , I l 2 , ..., I l Kl } to the l-th conductor.

 エージェント1の行動を決める方策πは、例えば、ロボットアームの制御の指針である。ロボットアームの制御の指針(方策π)に基づいて、ロボットアームの制御(行動a)が決定される。 The policy π that determines the behavior of agent 1 is, for example, a guideline for controlling a robot arm. Based on the guideline for controlling the robot arm (policy π), the control of the robot arm (action a) is determined.

 エージェント1が行動aを行うことで、エージェント1の状態Sは変化する。エージェント1の状態Sは、例えば、ロボットアームの関節の角度である。例えば、ロボットアームに接続されたn本の導線のそれぞれに電流が印加されることで、ロボットアームの関節の角度が変化し、状態Sが変化する。 When the agent 1 performs an action a, the state S a of the agent 1 changes. The state S a of the agent 1 is, for example, the angle of the joint of a robot arm. For example, when a current is applied to each of n conductors connected to the robot arm, the angle of the joint of the robot arm changes, and the state S a changes.

 報酬rは、例えば、ロボットアームの状態Sを変化させたことによって得られた価値である。例えば、ロボットアームの正確な制御がアームを30度曲げることである場合、エージェント1の行動aにより生じるロボットアームの状態Sが30度曲がった状態に近い程、報酬rは高くなる。 The reward r is, for example, a value obtained by changing the state S a of the robot arm. For example, if the precise control of the robot arm is to bend the arm by 30 degrees, the closer the state S a of the robot arm caused by the action a of the agent 1 is to a state bent by 30 degrees, the higher the reward r will be.

 図2は、第1実施形態に係る行動決定方法を担う強化学習における状態の変化を示す概念図である。強化学習において、エージェント1の状態は、学習により遷移する。 FIG. 2 is a conceptual diagram showing state changes in reinforcement learning, which is the behavior decision-making method according to the first embodiment. In reinforcement learning, the state of agent 1 transitions as a result of learning.

 図2におけるSa0、Sa1、Sa2、Sa3、Sa4のそれぞれは、エージェント1の状態を表す。エージェント1の状態は、初期状態Sa0から方策πに最適化計算を適用することで行動を選択し、その結果に従って、例えば、Sa0、Sa1、Sa2、Sa3のいずれかに遷移する。例えば、ロボットアームの制御の場合、関節が曲がっていない初期状態Sa0から次の状態Sa0、Sa1、Sa2、Sa3のいずれかに遷移する。次の状態は、元の関節が曲がっていない状態Sa0のままでもよいし、関節が曲がった別の状態Sa1、Sa2、Sa3のいずれかでもよい。何れの状態に遷移するかは、方策πに基づいて決定される。例えば、関節が曲がった状態Sa2に遷移した場合は、次の行動によってまた別の状態Sa0、Sa3、Sa4のいずれかに遷移する。強化学習では、例えば、状態の遷移を生み出す行動a毎に報酬rが求められる。また複数回同じ方策に則って行動選択、行動を実行した後(エピソード毎)に報酬rを求めても良い。 Each of S a0 , S a1 , S a2 , S a3 , and S a4 in FIG. 2 represents a state of the agent 1. The state of the agent 1 is selected by applying an optimization calculation to the policy π from the initial state S a0 , and transitions to, for example, any one of S a0 , S a1 , S a2 , and S a3 according to the result. For example, in the case of control of a robot arm, transitions from the initial state S a0 in which the joints are not bent to any one of the next states S a0 , S a1 , S a2 , and S a3 . The next state may remain the original state S a0 in which the joints are not bent, or may be any one of the other states S a1 , S a2 , and S a3 in which the joints are bent. The state to which the agent transitions is determined based on the policy π. For example, when transitioning to the state S a2 in which the joints are bent, transitions to another state S a0 , S a3 , and S a4 depending on the next action. In reinforcement learning, for example, a reward r is calculated for each action a that produces a state transition. Alternatively, the reward r may be calculated after selecting an action and executing the action multiple times according to the same policy (for each episode).

 本実施形態では、方策πを最適化計算可能なモデルで表す。そして最適化計算を行うことで、方策πに基づいて行動aを決定する。最適化計算は、計算パラメータによってランダム性を変更できる。詳細は後述するが、方策πに基づく行動aの決定を計算パラメータによってランダム性を変更できる最適化計算で行うことで、実環境2が方策πを決定する学習を行った際から変化した(揺らいだ)場合でも、環境に合わせた行動aを選択する可能性が高まる。 In this embodiment, the policy π is represented by a model that can be calculated by optimization. Then, an action a is determined based on the policy π by performing an optimization calculation. The randomness of the optimization calculation can be changed by calculation parameters. Although details will be described later, by determining an action a based on the policy π using an optimization calculation that can change the randomness by calculation parameters, even if the real environment 2 has changed (fluctuated) since the learning that determined the policy π was performed, the possibility of selecting an action a that matches the environment increases.

 最適化計算は、例えば、量子コンピュータで行われる断熱量子計算で行ってもよいし、量子アニーリングで行ってもよいし、遺伝的アルゴリズムで行ってもよい。 The optimization calculations may be performed, for example, using adiabatic quantum computing performed on a quantum computer, quantum annealing, or a genetic algorithm.

 例えば、最適化計算を量子アニーリングで行う場合、最適化計算に適用可能なモデルとしてイジングモデル又はQUBOを用いることができる。また例えば、最適化計算を断熱量子計算で行う場合は、任意のエルミート行列を最適化計算に適用な可能なモデルとして用いることができる。また例えば、最適化計算を遺伝的アルゴリズムで行う場合は、任意の実数値関数を最適化計算に適用な可能なモデルとして用いることができる。 For example, when the optimization calculation is performed using quantum annealing, an Ising model or QUBO can be used as a model applicable to the optimization calculation. Furthermore, when the optimization calculation is performed using adiabatic quantum computing, any Hermitian matrix can be used as a model applicable to the optimization calculation. Furthermore, when the optimization calculation is performed using a genetic algorithm, any real-valued function can be used as a model applicable to the optimization calculation.

 例えば、方策πをイジングモデルで表す場合、方策πは以下の式(1)で表される。 For example, when the policy π is expressed by the Ising model, the policy π is expressed by the following equation (1).

Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001

 ここでπ(a|S)は、最適化計算モデルであり、行動ベクトルaの各要素の関係性を表す。σ、σは、入力変数であり、+1又は-1の2値のいずれかを示す。hijは、相互作用パラメータである。hijは、例えば、エージェント1の状態Sの関数で表される。Nは、選択肢をone-hot表現で表す場合は、N=K+K+…+Kで表される。Kは、l番目(l=1~nの自然数)の導線に印加できる電流量の選択肢の数である。 Here, π(a|S a ) is an optimization calculation model, and represents the relationship between each element of the behavior vector a. σ i and σ j are input variables, and each have two values, +1 or -1. h ij is an interaction parameter. h ij is expressed, for example, as a function of the state S a of agent 1. When the options are expressed in one-hot representation, N is expressed as N=K 1 +K 2 +...+K n . K l is the number of options for the amount of current that can be applied to the lth conductor (l=a natural number from 1 to n).

 量子アニーリングを用いた最適化計算では、真の最適値を求めることができる確率が、計算時間によって変わる場合がある。例えば、無限に近い時間計算を行った後に求められた最適値と、短時間の計算を行った後に求められた最適値と、は必ずしも一致しない。 In optimization calculations using quantum annealing, the probability of finding the true optimal value can vary depending on the calculation time. For example, the optimal value found after performing calculations over a nearly infinite period of time will not necessarily be the same as the optimal value found after performing calculations over a short period of time.

 図3は、量子アニーリングを用いた最適化計算によって求められる解の確率分布のイメージ図である。図3の(a)は、最適化計算を長時間行った後に出力される解の確率分布であり、(b)は、最適化計算を短時間行った後に出力される解の確率分布である。真の最適値はvであり、確率分布は例えば正規分布で表現される。 3 is an image diagram of the probability distribution of a solution obtained by an optimization calculation using quantum annealing. (a) in FIG. 3 is a probability distribution of a solution output after performing an optimization calculation for a long time, and (b) is a probability distribution of a solution output after performing an optimization calculation for a short time. The true optimal value is v 0 , and the probability distribution is expressed, for example, by a normal distribution.

 量子アニーリングを用いた最適化計算を長時間行うと、真の最適値vを選択する確率が高まり、真の最適値v以外の値vを最適値として出力する確率は低くなる。これに対し、量子アニーリングを用いた最適化計算の時間が短いと、真の最適値vを選択する確率が低くなり、真の最適値v以外の値vを最適値として出力する確率は高くなる。これは、量子アニーリングにおける短い計算時間は、一般にはノイズとして扱われるためである。短時間の計算では、真の最適値v以外の値vを最適値として出力する場合がある。一般的な最適化計算では、真の最適値v以外の値vを最適値として出力することは問題であるが、本実施形態に係る行動決定方法ではこの特性を利用する。 When the optimization calculation using quantum annealing is performed for a long time, the probability of selecting the true optimal value v0 increases, and the probability of outputting a value v1 other than the true optimal value v0 as the optimal value decreases. In contrast, when the optimization calculation time using quantum annealing is short, the probability of selecting the true optimal value v0 decreases, and the probability of outputting a value v1 other than the true optimal value v0 as the optimal value increases. This is because the short calculation time in quantum annealing is generally treated as noise. In a short calculation, a value v1 other than the true optimal value v0 may be output as the optimal value. In a general optimization calculation, it is problematic to output a value v1 other than the true optimal value v0 as the optimal value, but the behavior decision-making method according to this embodiment utilizes this characteristic.

 図4は、方策πを最適化計算モデルで表し、最適化計算を行うことで選択される行動aの確率分布のイメージ図である。図4の(a)は、方策πに基づく最適化計算を長時間行った後に選択される行動aの確率分布であり、(b)は、方策πに基づく最適化計算を短時間行った後に選択される行動aの確率分布である。 Figure 4 shows an image of the probability distribution of action a selected by performing optimization calculations, with the policy π represented as an optimization calculation model. Figure 4 (a) shows the probability distribution of action a selected after performing optimization calculations based on the policy π for a long period of time, and (b) shows the probability distribution of action a selected after performing optimization calculations based on the policy π for a short period of time.

 方策πを最適化計算モデルで表し、最適化計算を行うことで行動aを求める場合、最適化計算の計算時間によって、最適とされる行動a以外の行動aを選択する確率が変化する。例えば、計算時間が長い程、最適とされる行動a以外の行動aを選択する確率は低くなり、計算時間が短い程、最適とされる行動a以外の行動aを選択する確率が高まる。量子アニーリングにおける計算時間を変えると、最適化計算のランダム性を変更できる。 When the policy π is represented by an optimization calculation model and an action a is obtained by performing an optimization calculation, the probability of selecting an action a1 other than the optimal action a0 changes depending on the calculation time of the optimization calculation. For example, the longer the calculation time, the lower the probability of selecting an action a1 other than the optimal action a0 , and the shorter the calculation time, the higher the probability of selecting an action a1 other than the optimal action a0 . Changing the calculation time in quantum annealing can change the randomness of the optimization calculation.

 図5は、方策に基づいてエージェントが選択する行動と、実環境の変化と、の関係を示すイメージ図である。実環境2が変化すると最適とされる行動が変化する場合がある。例えば、変化前の実環境2において行動aが最適であったとしても、実環境2が変化することで行動a’が最適となる場合がある。 5 is an image diagram showing the relationship between the action selected by the agent based on the policy and the change in the real environment. The optimal action may change when the real environment 2 changes. For example, even if the action a0 was optimal in the real environment 2 before the change, the action a0 ' may become optimal when the real environment 2 changes.

 最適化計算のランダム性が小さい(例えば、最適化計算の時間が長い)と、所定の実環境2において最適とされる行動a以外の行動を選択する確率が低い。この場合、実環境2が変化した際でも、変化前の実環境2において最適とされる行動aを選択し続ける確率が高くなる。換言すると、変化後の実環境2において最適である行動a’を選択する確率は低く、変化後の実環境2において最適である行動a’をしない場合もありうる。 When the randomness of the optimization calculation is small (for example, when the optimization calculation takes a long time), there is a low probability of selecting an action other than the action a0 that is considered optimal in a given real environment 2. In this case, even when the real environment 2 changes, there is a high probability of continuing to select the action a0 that is considered optimal in the real environment 2 before the change. In other words, there is a low probability of selecting the action a0 ' that is optimal in the real environment 2 after the change, and there may be cases where the action a0 ' that is optimal in the real environment 2 after the change is not performed.

 これに対し、最適化計算のランダム性が大きい(例えば、最適化計算の時間が短い)と、所定の実環境2において最適とされる行動a以外の行動を選択する確率が高い。この場合、実環境2が変化した場合でも、変化前の実環境2において最適とされる行動a以外の行動を選択し得る余地がある。つまり、変化前の実環境2において最適とされる行動a以外の行動を選択する確率が高いということは、実環境2が変化した後の環境において最適とされる行動a’を選択する確率が高いということと等しい。 In contrast, when the randomness of the optimization calculation is high (for example, the optimization calculation time is short), there is a high probability of selecting an action other than the action a0 considered optimal in a given real environment 2. In this case, even if the real environment 2 changes, there is room for selecting an action other than the action a0 considered optimal in the real environment 2 before the change. In other words, a high probability of selecting an action other than the action a0 considered optimal in the real environment 2 before the change is equivalent to a high probability of selecting an action a0 ' considered optimal in the environment after the real environment 2 has changed.

 本実施形態に係る行動決定方法では、実環境の変化量に基づいて、最適化計算のランダム性を変化させる。最適化計算のランダム性は、計算パラメータを変更することで、変更できる。上述のように、最適化計算が量子アニーリングの場合は、最適化計算の計算時間が計算パラメータの一つである。 In the behavior decision-making method according to this embodiment, the randomness of the optimization calculation is changed based on the amount of change in the real environment. The randomness of the optimization calculation can be changed by changing the calculation parameters. As described above, when the optimization calculation is performed using quantum annealing, the calculation time of the optimization calculation is one of the calculation parameters.

 遺伝的アルゴリズムで最適化計算を行う場合は、突然変異が起きる確率が最適化計算のランダム性に寄与する計算パラメータの一つである。突然変異パラメータを変更することで、最適化計算のランダム性を変更できる。また量子コンピュータで行われる断熱量子計算で最適化を行う場合は、計算器に加わる熱等のノイズがランダム性に寄与する計算パラメータの一つである。 When optimization calculations are performed using a genetic algorithm, the probability of mutations occurring is one of the calculation parameters that contribute to the randomness of the optimization calculation. The randomness of the optimization calculation can be changed by changing the mutation parameters. When optimization is performed using adiabatic quantum computing, which is performed on a quantum computer, noise such as heat applied to the calculator is one of the calculation parameters that contribute to randomness.

 図6は、第1実施形態に係る強化学習のフロー図の一例である。第1実施形態に係る行動決定方法は、行動決定ステップS1と状態変化ステップS2と学習ステップS3とを有する。 FIG. 6 is an example of a flow diagram of reinforcement learning according to the first embodiment. The behavior decision method according to the first embodiment has a behavior decision step S1, a state change step S2, and a learning step S3.

 行動決定ステップS1では、最適化計算に適用可能なモデルで表された方策πに基づいて、最適化計算を実行し、エージェント1の行動aを決定する。状態変化ステップS2では、行動aを実環境2に入力することで、実環境2とエージェント1とが相互作用し、エージェント1の状態が変化する。学習ステップS3では、行動a後のエージェント1の状態Sの価値を報酬rとして取得し、報酬rに基づいて、次の方策πを決定する。 In the action decision step S1, an optimization calculation is performed based on a policy π represented by a model applicable to the optimization calculation, and an action a of the agent 1 is determined. In the state change step S2, the action a is input to the real environment 2, whereby the real environment 2 and the agent 1 interact with each other, changing the state of the agent 1. In the learning step S3, the value of the state S a of the agent 1 after the action a is obtained as a reward r, and the next policy π is determined based on the reward r.

 行動決定ステップS1は、例えば、環境変化量測定ステップS11と、計算パラメータ設定ステップS12と、最適化計算ステップS13と、を有する。 The action decision step S1 includes, for example, an environmental change measurement step S11, a calculation parameter setting step S12, and an optimization calculation step S13.

 環境変化量測定ステップS11では、行動決定時(行動決定ステップS1)の実環境2を測定し、学習時(学習ステップS3)の実環境2の各パラメータとの変化量を得る。例えば、ロボットアームが動作する温度、湿度、部品の摩耗状態、ロボットアームが動作する床の材質、隣接するロボットアームの動きの変化量を測定する。実環境2の測定は、例えば、センサー等で行われる。測定する各パラメータは、事前に設定してある。環境変化量は、例えば、測定したそれぞれのパラメータの変化量を加算して求められる。各パラメータのうち、重要度の高いパラメータに係数をかけてもよい。 In the environmental change measurement step S11, the real environment 2 at the time of action decision (action decision step S1) is measured, and the amount of change with respect to each parameter of the real environment 2 at the time of learning (learning step S3) is obtained. For example, the temperature at which the robot arm operates, humidity, wear condition of parts, material of the floor on which the robot arm operates, and the amount of change in the movement of an adjacent robot arm are measured. The measurement of the real environment 2 is performed, for example, by a sensor, etc. Each parameter to be measured is set in advance. The amount of environmental change is found, for example, by adding up the amount of change in each measured parameter. Of the parameters, a coefficient may be multiplied to the parameter with the highest importance.

 計算パラメータ設定ステップS12では、環境変化量測定ステップS11で求められた環境変化量に基づいて、計算パラメータを設定する。計算パラメータは、最適化計算の方法によって異なる。例えば、最適化計算を量子アニーリングで行う場合は、量子アニーリングの計算時間は、計算パラメータの一つである。また例えば、最適化計算を遺伝的アルゴリズムで行う場合は、突然変異が生じる確率が、計算パラメータの一つである。また例えば、最適化計算を断熱量子計算で行う場合は、計算に加わる熱等のノイズが計算パラメータの一つである。 In calculation parameter setting step S12, calculation parameters are set based on the amount of environmental change found in environmental change measurement step S11. The calculation parameters differ depending on the method of optimization calculation. For example, when the optimization calculation is performed using quantum annealing, the calculation time of quantum annealing is one of the calculation parameters. In another example, when the optimization calculation is performed using a genetic algorithm, the probability of mutation is one of the calculation parameters. In another example, when the optimization calculation is performed using adiabatic quantum computing, noise such as heat added to the calculation is one of the calculation parameters.

 例えば、環境変化量をシグモイド関数に入力して、0以上1以下の数値(計算パラメータ)に変換してもよい。このような変換により、環境変化量が小さい場合は計算パラメータの変化量を小さくし、環境変化量が大きい場合は計算パラメータの変化量を大きくできる。 For example, the amount of environmental change may be input to a sigmoid function and converted into a numerical value (calculation parameter) between 0 and 1. This type of conversion makes it possible to reduce the amount of change in the calculation parameter when the amount of environmental change is small, and to increase the amount of change in the calculation parameter when the amount of environmental change is large.

 例えば、最適化計算を量子アニーリングで行う場合は、環境変化量が大きい程、量子アニーリングの計算時間を短くする。また例えば、最適化計算を遺伝的アルゴリズムで行う場合は、環境変化量が大きい程、突然変異が生じる確率を高くする。また例えば、最適化計算を断熱量子計算で行う場合は、環境変化量が大きい程、ノイズの発生源を追加する。 For example, when optimization calculations are performed using quantum annealing, the greater the amount of change in the environment, the shorter the quantum annealing calculation time will be. Also, for example, when optimization calculations are performed using a genetic algorithm, the greater the amount of change in the environment, the higher the probability of mutations occurring will be. Also, for example, when optimization calculations are performed using adiabatic quantum computing, the greater the amount of change in the environment, the more sources of noise will be added.

 最適化計算ステップS13では、計算パラメータ設定ステップS12で決定された計算パラメータに従い、最適化計算を行う。最適化計算ステップS13を行うことで、行動aが決定される。図4に示すように、最適化計算ステップS13で選択される行動aは、真に最適な行動aに限られない。 In the optimization calculation step S13, optimization calculation is performed according to the calculation parameters determined in the calculation parameter setting step S12. By performing the optimization calculation step S13, an action a is determined. As shown in Fig. 4, the action a selected in the optimization calculation step S13 is not limited to the truly optimal action a0 .

 最適化計算ステップS13では、所定の実環境2において真に最適な行動a以外の行動aも選択しうる。例えば、計算パラメータのランダム性が小さい程、真に最適な行動a以外の行動aを選択する確率は低くなり、計算パラメータのランダム性が大きい程、真に最適な行動a以外の行動aを選択する確率が高くなる。最適化計算ステップS13で、真に最適な行動a以外の行動aを選択し得ることで、実環境2の変化にも適用できる。 In the optimization calculation step S13, it is possible to select an action a1 other than the truly optimal action a0 in a given real environment 2. For example, the smaller the randomness of the calculation parameters, the lower the probability of selecting an action a1 other than the truly optimal action a0 , and the greater the randomness of the calculation parameters, the higher the probability of selecting an action a1 other than the truly optimal action a0 . By being able to select an action a1 other than the truly optimal action a0 in the optimization calculation step S13, it is also possible to adapt to changes in the real environment 2.

 状態変化ステップS2は、例えば、行動ステップS21と、動作ステップS22と、を有する。 The state change step S2 includes, for example, a behavior step S21 and an operation step S22.

 行動ステップS21では、方策πに従い決定された行動aが、実環境2におけるエージェント1に入力される。例えば、ロボットアームの動作を制御する導線のそれぞれに電流を印加することで、ロボットアームに行動aが入力される。それぞれに印加される電流は、最適化計算に基づいて決定される。方策πは最適化計算モデルで表されている。 In the action step S21, the action a determined according to the strategy π is input to the agent 1 in the real environment 2. For example, the action a is input to the robot arm by applying a current to each of the conductors that control the movement of the robot arm. The current applied to each is determined based on optimization calculations. The strategy π is represented by an optimization calculation model.

 動作ステップS22では、行動ステップS21で決定された行動aに従って、エージェント1が動作する。例えば、行動ステップS21でロボットアームの動作を制御する導線のそれぞれに電流が印加されることで、ロボットアームが動作する。エージェント1が動作することで、エージェント1の状態Sが変化する。例えば、ロボットアームが動作することで、ロボットアームの状態(例えば、関節の角度)が変化する。実環境2とエージェント1とが相互作用することでエージェント1の状態Sが変化する。 In the action step S22, the agent 1 acts in accordance with the action a determined in the action step S21. For example, in the action step S21, a current is applied to each of the conductors that control the movement of the robot arm, causing the robot arm to act. The action of the agent 1 changes the state S a of the agent 1. For example, the action of the robot arm changes the state of the robot arm (e.g., the angle of the joint). The interaction between the real environment 2 and the agent 1 changes the state S a of the agent 1.

 学習ステップS3は、例えば、実環境測定ステップS31と報酬計算ステップS32と方策決定ステップS33とを有する。 The learning step S3 includes, for example, a real-environment measurement step S31, a reward calculation step S32, and a strategy determination step S33.

 実環境測定ステップS31では、学習ステップS3における実環境2を測定する。学習ステップS3における実環境2は、上述の行動決定ステップS1における実環境2と必ずしも一致しない。方策を決定する学習ステップS3における実環境2を測定しておくことで、学習ステップS3と行動決定ステップS1との間における実環境2の変化量を求めることができ、計算パラメータを設定できる。実環境測定ステップS31は、報酬計算ステップS32の後に行ってもよい。 In the real environment measurement step S31, the real environment 2 in the learning step S3 is measured. The real environment 2 in the learning step S3 does not necessarily match the real environment 2 in the action decision step S1 described above. By measuring the real environment 2 in the learning step S3 where a strategy is decided, the amount of change in the real environment 2 between the learning step S3 and the action decision step S1 can be obtained, and calculation parameters can be set. The real environment measurement step S31 may be performed after the reward calculation step S32.

 報酬計算ステップS32では、行動a後のエージェント1の状態Sに対する報酬rを取得する。報酬rが高い程、エージェント1の行動aが適切であったと言える。 In the reward calculation step S32, a reward r for the state S a of the agent 1 after the action a is obtained. The higher the reward r, the more appropriate the action a of the agent 1 was.

 方策決定ステップS33では、報酬rに基づいて、次の方策を決定する。例えば、方策決定ステップS33では、前の行動の結果を反映して、方策πを表す最適化計算のモデルを作り直す。方策πは、各行動の関係性(エージェント1の状態Sにどのような影響を及ぼすか)を表す。報酬rが高い場合は、前の方策πが適切なモデルであったと言え、同様のモデルを作成する。報酬rが低い場合は、前の方策πが表す関係性が適切ではなかったと言え、前の行動を加味した新たなモデルが作成される。 In the policy decision step S33, the next policy is decided based on the reward r. For example, in the policy decision step S33, the result of the previous action is reflected and a model of the optimization calculation representing the policy π is remade. The policy π represents the relationship between each action (how it affects the state S a of the agent 1). If the reward r is high, it can be said that the previous policy π was an appropriate model, and a similar model is created. If the reward r is low, it can be said that the relationship represented by the previous policy π was not appropriate, and a new model that takes the previous action into account is created.

 第1実施形態に係る強化学習は、行動決定ステップS1と状態変化ステップS2と学習ステップS3とを繰り返すことで、データの価値を最大化(報酬が最大化)する学習を行う。環境変化量測定ステップS11、計算パラメータ設定ステップS12、実環境測定ステップS31は、これらのステップを繰り返す毎に行わなくてもよい。例えば、複数回ステップを繰り返す毎に、環境変化量測定ステップS11、計算パラメータ設定ステップS12、実環境測定ステップS31を行ってもよいし、ランダムに環境変化量測定ステップS11、計算パラメータ設定ステップS12、実環境測定ステップS31を行ってもよい。 The reinforcement learning according to the first embodiment repeats the action decision step S1, the state change step S2, and the learning step S3 to learn to maximize the value of the data (maximize the reward). The environmental change measurement step S11, the calculation parameter setting step S12, and the real environment measurement step S31 do not have to be performed every time these steps are repeated. For example, the environmental change measurement step S11, the calculation parameter setting step S12, and the real environment measurement step S31 may be performed every time the steps are repeated multiple times, or the environmental change measurement step S11, the calculation parameter setting step S12, and the real environment measurement step S31 may be performed randomly.

 第1実施形態に係る行動決定方法は、実環境2の変化にも対応できる。 The behavior decision-making method according to the first embodiment can also adapt to changes in the real environment 2.

 例えば、方策πが一定の確率分布で表現され、この方策に従って確率的に行動が決定される場合、環境が変化しても方策は変更せず、また、行動が決定される確率も変化しない。そのため、所定の実環境2で強化学習を行い、方策πが規定されると、環境変化が生じても行動決定方法に変化は生じない。この方策πは、実環境2が変化する前においては、最適な行動aを選択することができるが、実環境2が変化した後においてもその行動aが最適であるという確証は得られない。 For example, if the policy π is expressed as a certain probability distribution and actions are determined probabilistically according to this policy, the policy will not change even if the environment changes, and the probability that an action will be determined will not change either. Therefore, once reinforcement learning is performed in a specified real environment 2 and a policy π is specified, the method of determining actions will not change even if the environment changes. This policy π can select the optimal action a before the real environment 2 changes, but there is no certainty that action a is optimal even after the real environment 2 changes.

 この方策は、行動決定時と学習時の実環境2が変化したとしても、変化前の実環境2で学習された確率分布に従って確率的に行動aを選択する。つまり、実環境2が変化することで、真に最適な行動が変化した場合(a→a’)でも、aまたはa’が選ばれる確率は方策を決定した時点から変わらない。その結果、エージェント1は、変化後の実環境2において妥当ではない行動aを選択する可能性が高くなる。 With this policy, even if the real environment 2 changes between when the action is decided and when it is learned, the action a is selected probabilistically according to the probability distribution learned in the real environment 2 before the change. In other words, even if the truly optimal action changes ( a0a0 ') due to a change in the real environment 2, the probability that a0 or a0 ' will be selected does not change from the time the policy was decided. As a result, the agent 1 is more likely to select an action a that is inappropriate in the real environment 2 after the change.

 例えば、ロボットアームの接合部の摩耗度が悪化した場合、ロボットアームの動きがスムーズではなくなる。摩耗前の実環境2で求められた方策に従ってロボットアームの各導線に電流を印加しても、摩耗後の実環境では、ロボットアームが所望の状態に至らないことは起こりうる。 For example, if the degree of wear in the joints of the robot arm worsens, the movement of the robot arm will no longer be smooth. Even if current is applied to each conductor of the robot arm according to the strategy determined in the real environment 2 before wear, it is possible that the robot arm will not reach the desired state in the real environment after wear.

 これに対し、本実施形態に係る行動決定方法は、方策πを最適化計算に適用可能なモデルで表している。このモデルは、学習時の実環境2で設定され、実環境2が変化しても同じ方策πを利用できる。方策πに基づく行動決定は、最適化計算によって行われる。最適化計算の結果求められる行動aの確率分布は、計算パラメータによって変化する。実環境2が変化していない場合は、真に最適な行動aが選ばれる確率を高め、実環境2が変化した場合は、方策πにおける真に最適な行動aが選ばれるか確率を低くし、それ以外の行動aが選ばれる確率を高くできる。 In contrast, in the behavior decision-making method according to the present embodiment, the policy π is represented by a model applicable to optimization calculation. This model is set in the real environment 2 during learning, and the same policy π can be used even if the real environment 2 changes. Behavior decision based on the policy π is performed by optimization calculation. The probability distribution of the behavior a obtained as a result of the optimization calculation changes depending on the calculation parameters. When the real environment 2 has not changed, the probability that the truly optimal behavior a0 in the policy π is selected can be increased, and when the real environment 2 has changed, the probability that the truly optimal behavior a0 in the policy π is selected or is reduced, and the probability that another behavior a1 is selected can be increased.

 例えば、ロボットアームの接合部の摩耗度が悪化した場合、摩耗前の実環境2で最適とされた条件以外の条件で、ロボットアームの各導線に電流を印加するということが行いうる。例えば、本実施形態に係る行動決定方法では、方策πに従って、摩耗前の実環境2で最適とされた条件より多くの電流量をロボットアームの各導線に印加するという選択も行いうる。実際には、実環境2が変化した後の摩耗後の状態では、摩耗前の実環境2で最適とされた条件より多くの電流量をロボットアームの各導線に印加するという選択が適切である。 For example, if the degree of wear at the joints of the robot arm worsens, it is possible to apply current to each conductor of the robot arm under conditions other than those that were considered optimal in the real environment 2 before wear. For example, in the behavior decision-making method according to this embodiment, it is also possible to select, in accordance with the strategy π, to apply to each conductor of the robot arm a greater amount of current than was considered optimal in the real environment 2 before wear. In reality, in a post-wear state after the real environment 2 has changed, it is appropriate to select to apply to each conductor of the robot arm a greater amount of current than was considered optimal in the real environment 2 before wear.

 行動決定する際の実環境2は、学習により方策πを求めた際の実環境2と一致するとは限られない。それにも関わらず、行動決定は、前の実環境2で学習された方策πに基づいて決定される。本実施形態に係る行動決定方法では、行動決定のランダム性を変更することで、変更前の実環境2において最適とされる行動a以外の行動を選択する確率が高くなり、実環境2の変化に対応することができる。そのため、本実施形態に係る行動決定方法を用いると、新しい実環境2に適応した行動を選択しやすくなり、これまでの学習の経過を活用して学習を進めることができる。 The real environment 2 when deciding on an action does not necessarily match the real environment 2 when the policy π was obtained by learning. Nevertheless, the action is decided based on the policy π learned in the previous real environment 2. In the action decision method according to this embodiment, by changing the randomness of the action decision, the probability of selecting an action other than the action a0 that is optimal in the real environment 2 before the change increases, and it is possible to respond to changes in the real environment 2. Therefore, when the action decision method according to this embodiment is used, it becomes easier to select an action adapted to the new real environment 2, and learning can be advanced by utilizing the progress of learning so far.

 図7は、第1実施形態に係る行動決定装置10のブロック図である。行動決定装置10は、例えば、学習部11と第1制御部12と第2制御部13とメモリ14と計測部15と送受信部16とを備える。 FIG. 7 is a block diagram of the behavior decision device 10 according to the first embodiment. The behavior decision device 10 includes, for example, a learning unit 11, a first control unit 12, a second control unit 13, a memory 14, a measurement unit 15, and a transmission/reception unit 16.

 学習部11は、行動a後のエージェント1の状態Sを報酬rとして取得し、報酬rに基づいて、次の方策πを決定する。学習部11は、報酬rが最大化するように学習を行う。学習部11は、例えば、行動aと、その行動aの結果としてのエージェント1の状態Sをメモリ14から入手する。学習部11は、入手した行動aと状態Sとに基づいて、報酬rを求める。学習部11は、報酬rに基づいて方策πを決定する。 The learning unit 11 acquires the state S a of the agent 1 after the action a as a reward r, and determines the next policy π based on the reward r. The learning unit 11 performs learning so as to maximize the reward r. The learning unit 11 acquires, for example, the action a and the state S a of the agent 1 as a result of the action a from the memory 14. The learning unit 11 obtains the reward r based on the acquired action a and state S a . The learning unit 11 determines the policy π based on the reward r.

 学習部11は、例えば、演算器(CPU)を有する。学習部11は、例えば、学習ステップS3を行う。 The learning unit 11 has, for example, a computing unit (CPU). The learning unit 11 performs, for example, learning step S3.

 第1制御部12は、最適化計算に適用可能なモデルで表された方策に基づいて、最適化計算を実行し、エージェントの行動を決定する。第1制御部12は、例えば、行動決定ステップS1を行う。 The first control unit 12 executes an optimization calculation based on a strategy represented by a model applicable to the optimization calculation, and determines the behavior of the agent. The first control unit 12 performs, for example, the behavior determination step S1.

 学習部11で決定された方策πは、第1制御部12に伝えられる。第1制御部12は、計測部15で測定された実環境2の変化量に基づいて、計算パラメータを設定する。そして第1制御部12は、設定された計算パラメータで、最適化計算を実行し、エージェント1の行動aを決定する。第1制御部12は、例えば、演算器(CPU)を有する。 The strategy π determined by the learning unit 11 is transmitted to the first control unit 12. The first control unit 12 sets calculation parameters based on the amount of change in the real environment 2 measured by the measurement unit 15. The first control unit 12 then executes optimization calculations with the set calculation parameters and determines the action a of the agent 1. The first control unit 12 has, for example, a computing unit (CPU).

 第2制御部13は、行動aを実環境2に入力し、エージェント1の状態Sを変化させる。第2制御部13は、エージェント1に行動aを指示する。第2制御部13は、例えば、送受信部16を介して、エージェント1に指示を行う。第2制御部13は、例えば、演算器(CPU)を有する。 The second control unit 13 inputs the action a into the real environment 2, and changes the state S a of the agent 1. The second control unit 13 instructs the agent 1 to perform the action a. The second control unit 13 instructs the agent 1 via, for example, the transmitting/receiving unit 16. The second control unit 13 has, for example, a computing unit (CPU).

 メモリ14は、学習データ、行動決定方法に従うプログラム、環境変化の情報を記憶する。学習データは、例えば、エージェントの行動a、エージェント1の行動aの結果としてのエージェント1の状態S、状態Sの報酬r等である。 The memory 14 stores learning data, a program according to the behavior decision method, and information on environmental changes. The learning data includes, for example, an agent's behavior a, a state S a of the agent 1 as a result of the agent's behavior a, and a reward r for the state S a .

 計測部15は、例えば、センサーである。計測部15は、実環境2を測定する。計測部15は、例えば、実環境測定ステップS31、環境変化量測定ステップS11を行う際に用いられる。計測部15は、例えば、実環境2を表す各パラメータを測定する。 The measurement unit 15 is, for example, a sensor. The measurement unit 15 measures the real environment 2. The measurement unit 15 is used, for example, when performing the real environment measurement step S31 and the environmental change amount measurement step S11. The measurement unit 15 measures, for example, each parameter representing the real environment 2.

 送受信部16は、例えば、第2制御部13の指示に従い、行動aを装置に伝える。例えば、送受信部16は、ロボットアームの制御部に方策πに基づく行動aを送信する。また送受信部16は、行動aの結果として、エージェント1の状態Sを受信する。送受信部16は、有線でも無線でもよい。送受信部16は、状態変化ステップS2を司る。報酬計算ステップS32で得られた報酬rは、例えば、送受信部16を介してメモリ14に記憶される。 The transmitting/receiving unit 16 transmits the action a to the device, for example, according to an instruction from the second control unit 13. For example, the transmitting/receiving unit 16 transmits the action a based on the measure π to the control unit of the robot arm. The transmitting/receiving unit 16 also receives the state S a of the agent 1 as a result of the action a. The transmitting/receiving unit 16 may be wired or wireless. The transmitting/receiving unit 16 is responsible for the state change step S2. The reward r obtained in the reward calculation step S32 is stored in the memory 14, for example, via the transmitting/receiving unit 16.

 第1実施形態に係る行動決定装置10は、上述の行動決定方法に従って動作するため、実環境2の変化にも対応することができる。 The behavior decision-making device 10 according to the first embodiment operates according to the behavior decision-making method described above, and is therefore also capable of responding to changes in the real environment 2.

 以上、本発明の実施形態について図面を参照して詳述したが、各実施形態における各構成及びそれらの組み合わせ等は一例であり、本発明の趣旨から逸脱しない範囲内で、構成の付加、省略、置換、及びその他の変更が可能である。 The above describes the embodiments of the present invention in detail with reference to the drawings, but each configuration and their combinations in each embodiment are merely examples, and additions, omissions, substitutions, and other modifications of configurations are possible without departing from the spirit of the present invention.

1 エージェント
2 実環境
10 行動決定装置
11 学習部
12 第1制御部
13 第2制御部
14 メモリ
15 計測部
16 送受信部
a 行動
r 報酬
π 方策
 状態
S1 行動決定ステップ
S11 環境変化量測定ステップ
S12 計算パラメータ設定ステップ
S13 最適化計算ステップ
S2 状態変化ステップ
S21 行動ステップ
S22 動作ステップ
S3 学習ステップ
S31 実環境測定ステップ
S32 報酬計算ステップ
S33 方策決定ステップ
REFERENCE SIGNS LIST 1 Agent 2 Real environment 10 Action decision device 11 Learning unit 12 First control unit 13 Second control unit 14 Memory 15 Measurement unit 16 Transmitting/receiving unit a Action r Reward π Policy S a State S1 Action decision step S11 Environment change amount measurement step S12 Calculation parameter setting step S13 Optimization calculation step S2 State change step S21 Action step S22 Operation step S3 Learning step S31 Real environment measurement step S32 Reward calculation step S33 Policy decision step

Claims (5)

 最適化計算に適用可能なモデルで表された方策に基づいて、最適化計算を実行し、エージェントの行動を決定する、行動決定ステップと、
 前記行動を実環境に入力することで、前記実環境と前記エージェントとが相互作用し、前記エージェントの状態が変化する、状態変化ステップと、
 行動後の前記エージェントの状態の価値を報酬として取得し、前記報酬に基づいて、次の方策を決定する、学習ステップと、を有し、
 前記最適化計算は、計算パラメータによってランダム性を変更でき、
 前記計算パラメータを、前記学習ステップと前記行動決定ステップとの間における前記実環境の変化量に基づいて変更する、行動決定方法。
an action decision step of performing an optimization calculation and deciding an action of an agent based on a policy represented by a model applicable to the optimization calculation;
a state change step in which the behavior is input into a real environment, whereby the real environment and the agent interact with each other and a state of the agent changes;
A learning step of acquiring a value of the state of the agent after the action as a reward and determining a next policy based on the reward;
The optimization calculation can change randomness by a calculation parameter;
The behavior determination method includes changing the calculation parameters based on an amount of change in the real environment between the learning step and the behavior determination step.
 前記最適化計算は、量子コンピュータで行われる断熱量子計算である、請求項1に記載の行動決定方法。 The behavior decision-making method according to claim 1, wherein the optimization calculation is an adiabatic quantum calculation performed on a quantum computer.  前記最適化計算は、量子アニーリングであり、
 前記モデルは、イジングモデル又はQUBOである、請求項1に記載の行動決定方法。
The optimization calculation is quantum annealing,
The method of claim 1 , wherein the model is an Ising model or a QUBO model.
 前記最適化計算は、遺伝的アルゴリズムで行われる、請求項1に記載の行動決定方法。 The behavior decision-making method according to claim 1, wherein the optimization calculation is performed using a genetic algorithm.  学習部と、第1制御部と、第2制御部と、を有し、
 前記第1制御部は、最適化計算に適用可能なモデルで表された方策に基づいて、最適化計算を実行し、エージェントの行動を決定し、
 前記第2制御部は、前記行動を実環境に入力し、前記エージェントの状態を変化させ、
 前記学習部は、前記行動を行った後の前記エージェントの状態を報酬として取得し、前記報酬に基づいて、次の方策を決定し、
 前記最適化計算は、計算パラメータによってランダム性を変更でき、
 前記計算パラメータは、前記エージェントの行動を決定する際と前記方策を決定する際との間の前記実環境の変化量に基づいて変更される、行動決定装置。
A learning unit, a first control unit, and a second control unit,
The first control unit executes an optimization calculation based on a measure represented by a model applicable to the optimization calculation, and determines an action of an agent;
the second control unit inputs the action into a real environment and changes a state of the agent;
the learning unit acquires a state of the agent after performing the action as a reward, and determines a next policy based on the reward;
The optimization calculation can change randomness by a calculation parameter;
The calculation parameters are changed based on an amount of change in the real environment between when the agent's action is determined and when the strategy is determined.
PCT/JP2023/016404 2023-04-26 2023-04-26 Action determination method and action determination device Pending WO2024224501A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2023/016404 WO2024224501A1 (en) 2023-04-26 2023-04-26 Action determination method and action determination device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2023/016404 WO2024224501A1 (en) 2023-04-26 2023-04-26 Action determination method and action determination device

Publications (1)

Publication Number Publication Date
WO2024224501A1 true WO2024224501A1 (en) 2024-10-31

Family

ID=93256150

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/016404 Pending WO2024224501A1 (en) 2023-04-26 2023-04-26 Action determination method and action determination device

Country Status (1)

Country Link
WO (1) WO2024224501A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2386987A1 (en) * 2010-04-20 2011-11-16 Alcatel Lucent A method of reinforcement learning, corresponding computer program product, and data storage device therefor
JP2019005809A (en) * 2017-06-20 2019-01-17 リンカーン グローバル,インコーポレイテッド Machine learning for weldment classification and correlation
JP2022522180A (en) * 2020-01-10 2022-04-14 テンセント・テクノロジー・(シェンジェン)・カンパニー・リミテッド Insulation development path prediction methods, equipment, equipment and computer programs

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2386987A1 (en) * 2010-04-20 2011-11-16 Alcatel Lucent A method of reinforcement learning, corresponding computer program product, and data storage device therefor
JP2019005809A (en) * 2017-06-20 2019-01-17 リンカーン グローバル,インコーポレイテッド Machine learning for weldment classification and correlation
JP2022522180A (en) * 2020-01-10 2022-04-14 テンセント・テクノロジー・(シェンジェン)・カンパニー・リミテッド Insulation development path prediction methods, equipment, equipment and computer programs

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MIZOUE HIROYUKI, KOBAYASHI KUNIKAZU, KUREMOTO TAKASHI, OBAYASHI MASANAO: "A Meta-Parameter Learning Method in Reinforcement Learning Based on Temporal Difference Error.", IEEJ TRANSACTIONS ON ELECTRONICS, INFORMATION AND SYSTEMS SOCIETY PUBLICATION, DENKI GAKKAI / INSTITUTE OF ELECTRICAL ENGINEERS OF JAPAN, JP, vol. 129, no. 9, 1 January 2009 (2009-01-01), JP , pages 1730 - 1736, XP093227823, ISSN: 0385-4221, DOI: 10.1541/ieejeiss.129.1730 *

Similar Documents

Publication Publication Date Title
EP4085392B1 (en) Multi-objective reinforcement learning using objective-specific action-value functions
CN112327612B (en) Dynamical model for global stability modeling of system dynamics
CN111683799B (en) Action control device, system, method, storage medium, control and processing device
KR102093079B1 (en) System and method for classifying base on generative adversarial network using labeled data
Lim et al. Kernel-based reinforcement learning in robust Markov decision processes
JP7264845B2 (en) Control system and control method
WO2024224501A1 (en) Action determination method and action determination device
Yang et al. Mpr-rl: Multi-prior regularized reinforcement learning for knowledge transfer
US20240241486A1 (en) Method and apparatus for performing optimal control
Abdolmaleki et al. Contextual relative entropy policy search with covariance matrix adaptation
US8190536B2 (en) Method of performing parallel search optimization
Le et al. Model-based Q-learning for humanoid robots
Yao et al. Goal-driven navigation via variational sparse Q network and transfer learning
US12523969B2 (en) Method and apparatus for performing optimal control based on dynamic model
Terence et al. Dual Action Policy for Robust Sim-to-Real Reinforcement Learning
JP6518982B2 (en) Motion transfer device, motion transfer method and non-transitory computer readable medium storing motion transfer program
Kneissler et al. Filtering sensory information with XCSF: improving learning robustness and robot arm control performance
WO2025074369A1 (en) System and method for efficient collaborative marl training using tensor networks
Stahlhut et al. Interaction is more beneficial in complex reinforcement learning problems than in simple ones
CN118732488B (en) Adaptive control method and system based on shaft groove gauge detection control system
Moudgalya et al. A Comparative Study of Model-Free Reinforcement Learning Approaches
KR102093090B1 (en) System and method for classifying base on generative adversarial network using labeled data
JP7575007B2 (en) Hand load estimation device, robot control system, and robot system
Mahavarpour Integrating Adaptive Reinforcement Learning and Robotic Process Automation for Real-Time Decision-Making in Dynamic Environments.
KR102093089B1 (en) System and method for classifying base on generative adversarial network using labeled data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23935281

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE