WO2025004859A1

WO2025004859A1 - Learning device, control system, input/output device, learning method, and recording medium

Info

Publication number: WO2025004859A1
Application number: PCT/JP2024/021691
Authority: WO
Inventors: 駿平窪澤; 貴士大西; 慶雅鶴岡
Original assignee: NEC Corp; National Institute of Advanced Industrial Science and Technology AIST
Current assignee: NEC Corp; National Institute of Advanced Industrial Science and Technology AIST
Priority date: 2023-06-26
Filing date: 2024-06-14
Publication date: 2025-01-02
Anticipated expiration: 2025-12-26

Abstract

In the present invention, a learning device generates a policy, which is a rule for determining a to-be-controlled action, on the basis of a reward value that indicates an evaluation of the action in a state in which a time step is traced back in an episode representing a learning period, and the learning device determines the to-be-controlled action on the basis of the policy.

Description

Learning device, control system, input/output device, learning method, and recording medium

　本開示は、学習装置、制御システム、入出力装置、学習方法および記録媒体に関する。 This disclosure relates to a learning device, a control system, an input/output device, a learning method, and a recording medium.

　機械学習の１つに強化学習がある（例えば、特許文献１参照）。 One type of machine learning is reinforcement learning (see, for example, Patent Document 1).

日本国特開２０２２－１８２５８１号公報Japanese Patent Application Publication No. 2022-182581

　シミュレーションを用いた強化学習を効率的に行えることが好ましい。 It would be preferable to be able to efficiently perform reinforcement learning using simulation.

　本開示の目的の一例は、上述の課題を解決することのできる学習装置、制御システム、入出力装置、学習方法、および記録媒体を提供することである。 One example of the objective of this disclosure is to provide a learning device, a control system, an input/output device, a learning method, and a recording medium that can solve the above-mentioned problems.

　本開示の第１の態様によれば、学習装置は、学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象の行動に対する評価を示す値である報酬値に基づいて、前記行動の決定規則である方策を生成する方策生成手段と、前記制御対象の行動を前記方策に基づいて決定する行動決定手段と、を備える。 According to a first aspect of the present disclosure, the learning device includes a policy generation means for generating a policy, which is a decision rule for the behavior of a controlled object, based on a reward value, which is a value indicating an evaluation of the behavior of the controlled object at a time step back within an episode representing a learning period, and a behavior decision means for deciding the behavior of the controlled object based on the policy.

　本開示の第２の態様によれば、制御システムは、学習装置と制御装置とを備え、前記学習装置は、学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象の行動に対する評価を示す値である報酬値に基づいて、前記行動の決定規則である方策を生成する方策生成手段と、前記制御対象の行動を前記方策に基づいて決定する行動決定手段と、を備え、前記制御装置は、前記学習装置を用いて得られた方策に基づいて、制御対象に対する制御を行う。 According to a second aspect of the present disclosure, a control system includes a learning device and a control device, the learning device includes a policy generation means for generating a policy, which is a decision rule for the behavior of the control object, based on a reward value, which is a value indicating an evaluation of the behavior of the control object at a time step back within an episode representing a learning period, and a behavior decision means for deciding the behavior of the control object based on the policy, and the control device controls the control object based on the policy obtained using the learning device.

　本開示の第３の態様によれば、入出力装置は、制御対象が行動する環境の状態をユーザに提示する状態提示手段と、前記環境を模擬するシミュレータの中断が必要な状態を指定するユーザ操作を受け付ける状態指定受付手段と、を備える。 According to a third aspect of the present disclosure, the input/output device includes a state presenting means for presenting to a user the state of the environment in which the controlled object acts, and a state designation receiving means for receiving a user operation for designating a state in which the simulator that simulates the environment needs to be interrupted.

　本開示の第４の態様によれば、学習方法は、学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象の行動に対する評価を示す値である報酬値に基づいて、前記行動の決定規則である方策を生成し、前記制御対象の行動を前記方策に基づいて決定する、ことを含む。 According to a fourth aspect of the present disclosure, the learning method includes generating a policy, which is a decision rule for the behavior of a controlled object, based on a reward value, which is a value indicating an evaluation of the behavior of the controlled object at a time step back within an episode representing a learning period, and determining the behavior of the controlled object based on the policy.

　本開示の第５の態様によれば、記録媒体は、コンピュータに、学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象の行動に対する評価を示す値である報酬値に基づいて、前記行動の決定規則である方策を生成することと、前記制御対象の行動を前記方策に基づいて決定することと、を実行させるためのプログラムを記憶する。 According to a fifth aspect of the present disclosure, the recording medium stores a program for causing a computer to execute the following: generating a policy, which is a decision rule for the behavior of a controlled object, based on a reward value, which is a value indicating an evaluation of the behavior of the controlled object at a time step back within an episode representing a learning period; and determining the behavior of the controlled object based on the policy.

　本開示によれば、シミュレーションを用いた強化学習を比較的効率的に行うことができる。 According to the present disclosure, reinforcement learning using simulation can be performed relatively efficiently.

本開示のいくつかの実施形態に係る制御システムの構成の例を示す図である。FIG. 1 illustrates an example configuration of a control system according to some embodiments of the present disclosure. 本開示のいくつかの実施形態に係る学習装置の構成の例を示す図である。FIG. 2 is a diagram illustrating an example of the configuration of a learning device according to some embodiments of the present disclosure. 累積報酬値の計算の元となる瞬時報酬値の例を示す図である。FIG. 13 is a diagram showing an example of an instantaneous reward value that is the basis for calculating a cumulative reward value. 累積報酬値の例を示す図である。FIG. 13 is a diagram illustrating an example of a cumulative reward value. 本開示のいくつかの実施形態に係るリセット先候補に設定されている確率分布の第１の例を示す図である。FIG. 11 is a diagram showing a first example of a probability distribution set for reset destination candidates according to some embodiments of the present disclosure. 本開示のいくつかの実施形態に係るリセット先候補に設定されている確率分布の第２の例を示す図である。FIG. 13 is a diagram showing a second example of a probability distribution set for reset destination candidates according to some embodiments of the present disclosure. 本開示のいくつかの実施形態に係るリセット先候補に設定されている確率分布の第３の例を示す図である。FIG. 13 is a diagram showing a third example of a probability distribution set for reset destination candidates according to some embodiments of the present disclosure. 本開示のいくつかの実施形態に係る学習装置におけるデータの入出力の例を示す図である。A diagram showing an example of data input and output in a learning device according to some embodiments of the present disclosure. 本開示のいくつかの実施形態に係る学習装置が行う処理の手順の例を示す図である。A diagram showing an example of a processing procedure performed by a learning device according to some embodiments of the present disclosure. 本開示のいくつかの実施形態に係る学習装置の構成の、もう１つの例を示す図である。FIG. 13 is a diagram illustrating another example of the configuration of a learning device according to some embodiments of the present disclosure. 本開示のいくつかの実施形態に係る制御システムの構成の、もう１つの例を示す図である。FIG. 2 illustrates another example of a control system configuration according to some embodiments of the present disclosure. 本開示のいくつかの実施形態に係る入出力装置の構成の例を示す図である。FIG. 2 is a diagram illustrating an example of a configuration of an input/output device according to some embodiments of the present disclosure. 本開示のいくつかの実施形態に係る学習方法における処理の手順の例を示す図である。FIG. 1 is a diagram illustrating an example of a processing procedure in a learning method according to some embodiments of the present disclosure. 本開示の少なくとも１つの実施形態に係るコンピュータの構成の例を示す図である。FIG. 1 illustrates an example of a computer configuration in accordance with at least one embodiment of the present disclosure.

　以下、本開示の実施形態を説明するが、以下の実施形態は請求の範囲にかかる開示を限定するものではない。また、実施形態の中で説明されている特徴の組み合わせの全てが開示の解決手段に必須であるとは限らない。
　図１は、本開示のいくつかの実施形態に係る制御システムの構成の例を示す図である。図１に示す構成で、制御システム１は、学習装置１００と、制御装置２００と、制御対象９１０とを備える。 Hereinafter, embodiments of the present disclosure will be described, but the following embodiments do not limit the disclosure according to the claims. Furthermore, not all of the combinations of features described in the embodiments are necessarily essential to the solution of the disclosure.
1 is a diagram illustrating an example of a configuration of a control system according to some embodiments of the present disclosure. In the configuration illustrated in FIG. 1, the control system 1 includes a learning device 100, a control device 200, and a control target 910.

　制御システム１は、制御対象９１０に対する制御を学習し、学習結果に基づいて制御対象９１０を制御するシステムである。
　学習装置１００は、制御対象９１０に対する制御を学習する。特に、学習装置１００は、シミュレーションを用いた強化学習にて、制御対象９１０に対する制御を学習する。 The control system 1 is a system that learns control of a control target 910 and controls the control target 910 based on the learning results.
The learning device 100 learns control over the control object 910. In particular, the learning device 100 learns control over the control object 910 by reinforcement learning using a simulation.

　ここでいう強化学習は、ある環境（Environment）において行動（Action）を行うエージェント（Agent）の行動規則である方策（Policy）を、行動に対する評価を表す報酬（Reward）に基づいて学習する機械学習である。
　環境の状態（State）を、単に状態とも称する。ここでの環境は、エージェントを含んでいてもよい。したがって、ここでいう状態は、エージェントの状態を含んでいてもよい。 The reinforcement learning referred to here is a machine learning technique that learns a policy, which is the behavioral rule of an agent that performs an action in a certain environment, based on a reward, which represents an evaluation of the action.
The state of the environment is also referred to simply as the state. The environment may include an agent. Therefore, the state may include the state of the agent.

　学習装置１００は、強化学習における１ステップごとに、その時の状態のもとで方策に基づいて行動を決定し、決定した行動のシミュレーションをおこなって、次のステップにおける状態である次状態を算出する。また、学習装置１００は、得られた次状態に基づいて報酬値（報酬の値）を算出し、算出した報酬値に基づいて方策を更新する。ここでの方策の更新は、方策の生成と捉えることができる。すなわち、学習装置１００が、過去の方策に基づいて方策を生成する、と捉えることができる。 For each step in reinforcement learning, the learning device 100 determines an action based on a policy in the state at that time, simulates the determined action, and calculates the next state, which is the state at the next step. The learning device 100 also calculates a reward value based on the obtained next state, and updates the policy based on the calculated reward value. The update of the policy here can be considered as the generation of a policy. In other words, the learning device 100 can be considered to generate a policy based on a past policy.

　以下では、強化学習におけるステップを時間ステップまたは単にステップとも称する。また、以下では、時刻を時間ステップで表す。
　学習装置１００が用いる強化学習の手法は、特定の種類の手法に限定されない。例えば、学習装置１００が、Ｑ学習（Q-Learning）またはＳＡＲＳＡなど公知の強化学習手法に基づいて、制御対象９１０に対する制御の学習を行うようにしてもよい。 Hereinafter, a step in reinforcement learning will be referred to as a time step or simply as a step. Also, below, time will be represented by a time step.
The reinforcement learning method used by the learning device 100 is not limited to a specific type of method. For example, the learning device 100 may learn control of the control target 910 based on a known reinforcement learning method such as Q-learning or SARSA.

　学習装置１００は、例えば、エピソードの各々について、そのエピソードが終了するまでステップごとの処理を繰り返す。ここでいうエピソードは、強化学習における１つの時間単位であり、エージェントが一連の行動を開始してから終了するまでの時間区間である。例えば、学習装置１００は、エピソードの終了条件として予め定められている条件が成立するまで、そのエピソードにおけるステップごとの処理を繰り返す。 For example, the learning device 100 repeats the processing for each step of each episode until the episode ends. An episode here is a unit of time in reinforcement learning, and is a time period from when an agent starts a series of actions until when it ends. For example, the learning device 100 repeats the processing for each step in the episode until a condition that is predetermined as the condition for ending the episode is met.

　また、学習装置１００は、シミュレーションを中断する条件として予め定められている条件が成立していると判定した場合、エピソードの途中でも、そのシミュレーションを中断する。シミュレーションを中断する条件を、中断条件とも称する。シミュレーションを中断することは、実行中のエピソードを中断することともいえる。
　これにより、学習装置１００は、実行中のエピソードをそれ以上続けても方策の更新が期待できない状態に陥った場合、あるいは、方策の更新がなかなか進まないと予想される状態に陥った場合に、自動的にそのエピソードの実行を中断することができる。
　方策の更新は、方策の改善と捉えることができる。方策が更新されることは、強化学習が進むことと捉えることができる。 Furthermore, when the learning device 100 determines that a predetermined condition for interrupting the simulation is satisfied, the learning device 100 interrupts the simulation even in the middle of an episode. The condition for interrupting the simulation is also referred to as an interruption condition. Interrupting the simulation can also be said to interrupt the episode being executed.
This allows the learning device 100 to automatically interrupt the execution of an episode when the learning device 100 falls into a state where policy updating is not expected even if the episode is continued, or when the learning device 100 falls into a state where policy updating is expected to progress slowly.
Policy updates can be thought of as policy improvements. Policy updates can be thought of as progress in reinforcement learning.

　エピソードの実行を中断した学習装置１００が、そのエピソード内で時間ステップを遡って、そのエピソードの実行を再開するようにしてもよい。あるいは、エピソードの実行を中断した学習装置１００が、他のエピソードの実行を開始するようにしてもよい。
　学習装置１００によれば、方策の更新が期待できない状態に陥っても、あるいは、方策の更新がなかなか進まないと予想される状態に陥ってもエピソードの実行を継続する場合と比較して、強化学習をより効率的に行うことができる。 The learning device 100 that has suspended the execution of an episode may go back a time step within the episode and resume the execution of the episode. Alternatively, the learning device 100 that has suspended the execution of an episode may start the execution of another episode.
According to the learning device 100, reinforcement learning can be performed more efficiently than in a case where the execution of an episode is continued even when the state in which policy updating is not expected or when the state in which policy updating is expected to progress slowly is reached.

　また、学習装置１００は、エピソードの実行において、複数の時間ステップのそれぞれにおける状態を記憶しておく。そして、学習装置１００は、エピソードの実行を中断した際、状態を記憶している時間ステップの何れかを選択し、選択した時間ステップに戻って、そのエピソードの実行を再開する。具体的には、学習装置１００は、選択した時間ステップの状態から時間ステップごとの処理を再開する。 The learning device 100 also stores the state at each of a number of time steps during the execution of an episode. When the learning device 100 suspends the execution of an episode, it selects one of the time steps for which the state is stored, returns to the selected time step, and resumes the execution of that episode. Specifically, the learning device 100 resumes processing for each time step from the state of the selected time step.

　ここで、エピソードを毎回最初から実行する場合を考えると、エピソードの最初の方は、既に学習が十分進んだ部分となり、それ以上、方策の更新が進まないことが考えられる。学習装置１００が、エピソードを遡る先として、エピソードの最初に限らず途中の時間ステップも選択できることで、より効率的に強化学習を行えることが期待される。
　エピソードを遡る先の時間ステップを、リセット先とも称する。リセット先の候補をリセット先候補とも称する。リセット先候補は、例えば、シミュレーションの中断時まで実行されていたエピソード内でシミュレーションを再開可能な時間ステップとして状態が記憶されている実行済みの時間ステップである。
　シミュレーションの実行を中断し、シミュレーションにおける状態をリセット先の状態に変更することを、シミュレーションのリセットとも称する。 Here, if we consider a case where an episode is executed from the beginning every time, the beginning of the episode is a part where learning has already progressed sufficiently, and it is conceivable that the policy update will not proceed any further. By allowing the learning device 100 to select a time step in the middle of an episode as a destination to go back to, not just the beginning of the episode, it is expected that reinforcement learning can be performed more efficiently.
The time step to go back in the episode is also called the reset destination. The candidates for the reset destination are also called reset destination candidates. The reset destination candidates are, for example, executed time steps whose states are stored as time steps at which the simulation can be resumed within the episode that was executed until the simulation was interrupted.
Interrupting the execution of a simulation and changing the state in the simulation to a reset state is also called resetting the simulation.

　制御装置２００は、学習装置１００による学習で得られた方策に基づいて、制御対象９１０を制御する。
　制御対象９１０は、特定のものに限定されず、強化学習を用いて制御対象９１０に対する制御を学習可能ないろいろなものとすることができる。例えば、制御対象９１０は、制御対象９１０は、工場（Plant または Factory）または発電プラントなどの設備であってもよいし、工場における製造ラインなどのシステムであってもよいし、単体の装置であってもよい。あるいは、制御対象９１０は、自動車、鉄道車両、飛行機、船舶、自走式移動ロボットなどの移動体であってもよいし、鉄道または航空管制システムなどの交通システムであってもよい。
　制御対象９１０が、制御システム１の一部として構成されていてもよいし、制御システム１の外部の構成となっていてもよい。 The control device 200 controls the control target 910 based on the policy obtained by learning using the learning device 100 .
The control object 910 is not limited to a specific one, and may be various objects capable of learning control over the control object 910 using reinforcement learning. For example, the control object 910 may be equipment such as a plant or a power plant, a system such as a manufacturing line in a factory, or a standalone device. Alternatively, the control object 910 may be a moving object such as an automobile, a railroad vehicle, an airplane, a ship, or a self-propelled mobile robot, or a transportation system such as a railroad or air traffic control system.
The controlled object 910 may be configured as a part of the control system 1 , or may be configured external to the control system 1 .

　学習時には、学習装置１００があればよく、制御装置２００と制御対象９１０とは無くてもよい。また、制御の実行時には、制御装置２００と制御対象９１０とがあればよく、学習装置１００は無くてもよい。
　学習装置１００、制御装置２００、および、制御対象９１０の全て、または、これらのうちの２つの組み合わせが、一体的に構成されていてもよい。例えば、学習装置１００と、制御装置２００とが１つの装置として構成されていてもよい。また、制御装置２００と制御対象９１０とが１つの装置として構成されていてもよい。 During learning, it is sufficient to have the learning device 100, and it is not necessary to have the control device 200 and the control target 910. Furthermore, during execution of control, it is sufficient to have the control device 200 and the control target 910, and it is not necessary to have the learning device 100.
All of the learning device 100, the control device 200, and the control target 910, or a combination of two of them, may be configured integrally. For example, the learning device 100 and the control device 200 may be configured as a single device. Also, the control device 200 and the control target 910 may be configured as a single device.

　図２は、学習装置１００の構成の例を示す図である。図２に示す構成で、学習装置１００は、通信部１１０と、表示部１２０と、操作入力部１３０と、記憶部１８０と、処理部１９０とを備える。処理部１９０は、行動決定部１９１と、シミュレーション部１９２と、方策生成部１９３と、リセット決定部１９４と、再開状態決定部１９５とを備える。 FIG. 2 is a diagram showing an example of the configuration of the learning device 100. In the configuration shown in FIG. 2, the learning device 100 includes a communication unit 110, a display unit 120, an operation input unit 130, a memory unit 180, and a processing unit 190. The processing unit 190 includes an action decision unit 191, a simulation unit 192, a measure generation unit 193, a reset decision unit 194, and a resume state decision unit 195.

　通信部１１０は、他の装置と通信を行う。例えば、通信部１１０が、シミュレーションを行うための各種情報を他の装置から受信するようにしてもよい。また、通信部１１０が、学習で得られた方策を制御装置２００へ送信するようにしてもよい。 The communication unit 110 communicates with other devices. For example, the communication unit 110 may receive various information for performing a simulation from other devices. The communication unit 110 may also transmit the measures obtained by learning to the control device 200.

　表示部１２０は、例えば液晶パネルまたはＬＥＤ（Light Emitting Diode、発光ダイオード）パネル等の表示画面を備え、各種画像を表示する。例えば、表示部１２０が、学習装置１００が行う学習の進捗状況など、学習装置１００が行う学習に関する情報を表示するようにしてもよい。 The display unit 120 has a display screen, such as a liquid crystal panel or an LED (Light Emitting Diode) panel, and displays various images. For example, the display unit 120 may display information related to the learning performed by the learning device 100, such as the progress of the learning performed by the learning device 100.

　操作入力部１３０は、例えばキーボードおよびマウス等の入力デバイスを備え、ユーザ操作を受け付ける。例えば、操作入力部１３０が、制御対象９１０に対する制御の学習を開始するよう指示するユーザ操作を受け付けるようにしてもよい。
　記憶部１８０は、各種データを記憶する。例えば、シミュレーションにおける状態のスナップショットを記憶するようにしてもよい。記憶部１８０は、学習装置１００が備える記憶デバイスを用いて構成される。 The operation input unit 130 includes input devices such as a keyboard and a mouse, and receives user operations. For example, the operation input unit 130 may receive a user operation to instruct the control unit 910 to start learning control over the control target 910.
The storage unit 180 stores various data. For example, a snapshot of a state in a simulation may be stored. The storage unit 180 is configured using a storage device included in the learning device 100.

　処理部１９０は、学習装置１００の各部を制御して各種処理を行う。処理部１９０の機能は、例えば学習装置１００が備えるＣＰＵ（Central Processing Unit、中央処理装置）が、記憶部１８０からプログラムを読み出して実行することで実行される。
　行動決定部１９１は、強化学習における制御対象９１０の行動を決定する。行動決定部１９１は、方策に基づいて、制御対象９１０の行動を決定する。
　行動決定部１９１は、行動決定手段の例に該当する。 The processing unit 190 performs various processes by controlling each unit of the learning device 100. The functions of the processing unit 190 are performed, for example, by a CPU (Central Processing Unit) included in the learning device 100 reading a program from the storage unit 180 and executing it.
The behavior decision unit 191 decides the behavior of the control object 910 in reinforcement learning. The behavior decision unit 191 decides the behavior of the control object 910 based on a policy.
The behavior determining unit 191 corresponds to an example of a behavior determining means.

　シミュレーション部１９２は、環境のシミュレーションを行う。特に、シミュレーション部１９２は、行動決定部１９１が決定した行動のシミュレーションをおこなって、行動後の状態である次状態を計算する。ここで、シミュレーション上の制御対象９１０は、エージェントまたはその一部と捉えることができ、シミュレーションの対象の環境に制御対象９１０が含まれていてもよい。シミュレーションにおける状態に制御対象９１０の状態が含まれていてもよい。
　シミュレーション部１９２は、シミュレーション手段の例に該当する。
　シミュレーション部１９２が、学習装置１００の一部として構成されていてもよいし、学習装置１００の外部の構成となっていてもよい。 The simulation unit 192 performs a simulation of the environment. In particular, the simulation unit 192 performs a simulation of the action determined by the action determination unit 191, and calculates a next state which is a state after the action. Here, the control object 910 in the simulation can be regarded as an agent or a part thereof, and the control object 910 may be included in the environment that is the target of the simulation. The state of the control object 910 may be included in the state in the simulation.
The simulation unit 192 corresponds to an example of a simulation means.
The simulation unit 192 may be configured as a part of the learning device 100 , or may be configured external to the learning device 100 .

　方策生成部１９３は、報酬値に基づいて方策を更新する。上述したように、報酬値は、行動に対する評価を示す値である。方策は、行動の決定規則である。上述したように、ここでの方策の更新は、方策の生成と捉えることができる。
　方策生成部１９３が、瞬時報酬値に基づいて方策を更新するようにしてもよいし、累積報酬値に基づいて方策を更新するようにしてもよい。瞬時報酬値は、１ステップごとに、状態、行動、および、次状態、またはこれらのうちの一部に基づいて計算される、そのステップにおける行動に対する評価を示す値である。累積報酬値は、瞬時報酬値を複数ステップ分累積して算出される値である。累積報酬値の計算の際、瞬時報酬値に忘却係数などの係数値が乗算されていてもよい。
　あるいは、方策生成部１９３が、価値関数値（価値とも称する）に基づいて方策を更新するようにしてもよい。価値関数値は、例えばエピソードの終了時における累積報酬値の期待値など、累積報酬値の予測値である。 The policy generator 193 updates the policy based on the reward value. As described above, the reward value is a value indicating an evaluation for an action. The policy is a decision rule for an action. As described above, the update of the policy here can be considered as the generation of the policy.
The policy generation unit 193 may update the policy based on the instantaneous reward value, or may update the policy based on the cumulative reward value. The instantaneous reward value is a value calculated for each step based on the state, action, and next state, or a part of these, and indicates an evaluation of the action in that step. The cumulative reward value is a value calculated by accumulating the instantaneous reward values for multiple steps. When calculating the cumulative reward value, the instantaneous reward value may be multiplied by a coefficient value such as a forgetting coefficient.
Alternatively, the policy generator 193 may update the policy based on a value function value (also referred to as value), which is a prediction of the cumulative reward value, e.g., an expected value of the cumulative reward value at the end of an episode.

　瞬時報酬値、累積報酬値、および、価値関数値は、いずれも報酬値の例に該当する。
　学習装置１００が、報酬値として、値が大きいほど良い評価を示すような報酬値を用いるようにしてもよいし、値が小さいほど良い評価を示すような報酬値を用いるようにしてもよい。以下では、学習装置１００が、値が大きいほど良い評価を示すような報酬値を用いる場合を例に説明する。
　行動決定部１９１と方策生成部１９３との組み合わせを、エージェント処理部とも称する。 The instantaneous reward value, the cumulative reward value, and the value function value are all examples of reward values.
The learning device 100 may use a reward value in which the larger the value, the better the evaluation, or may use a reward value in which the smaller the value, the better the evaluation. In the following, an example will be described in which the learning device 100 uses a reward value in which the larger the value, the better the evaluation.
The combination of the action decision unit 191 and the measure generation unit 193 is also called an agent processing unit.

　リセット決定部１９４は、中断条件が成立していると判定した場合、シミュレーション部１９２によるシミュレーションを中断する。具体的には、リセット決定部１９４は、シミュレーション部１９２が行っているシミュレーションを中断させる。上述したように、中断条件は、制御対象９１０の行動のシミュレーションを中断する条件として予め定められている条件である。
　リセット決定部１９４は、リセット決定手段の例に該当する。 When the reset determination unit 194 determines that the interruption condition is satisfied, it interrupts the simulation by the simulation unit 192. Specifically, the reset determination unit 194 interrupts the simulation being performed by the simulation unit 192. As described above, the interruption condition is a condition that is determined in advance as a condition for interrupting the simulation of the behavior of the control target 910.
The reset determination unit 194 corresponds to an example of a reset determination means.

　学習装置１００について上述したように、リセット決定部１９４は、実行中のエピソードをそれ以上続けても方策の更新が期待できない状態に陥った場合、あるいは、方策の更新がなかなか進まないと予想される状態に陥った場合に、自動的にそのエピソードの実行を中断することができる。 As described above for the learning device 100, the reset decision unit 194 can automatically interrupt the execution of an episode if the episode falls into a state where policy updates are not expected to occur even if the episode continues to be executed, or if the episode falls into a state where policy updates are expected to proceed slowly.

　エピソードの実行を中断したリセット決定部１９４が、そのエピソード内で時間ステップを遡って、遡った先の時間ステップの状態から、シミュレーション部１９２によるシミュレーションを再開させるようにしてもよい。あるいは、エピソードの実行を中断したリセット決定部１９４が、他のエピソードを選択し、選択したエピソードにおけるシミュレーションをシミュレーション部１９２に行わせるようにしてもよい。
　これにより、学習装置１００では、方策の更新が期待できない状態に陥っても、あるいは、方策の更新がなかなか進まないと予想される状態に陥ってもエピソードの実行を継続する場合と比較して、強化学習をより効率的に行うことができる。 The reset determination unit 194 that has suspended the execution of an episode may go back a time step within the episode and restart the simulation by the simulation unit 192 from the state of the time step that was gone back. Alternatively, the reset determination unit 194 that has suspended the execution of an episode may select another episode and cause the simulation unit 192 to perform a simulation of the selected episode.
This allows the learning device 100 to perform reinforcement learning more efficiently compared to a case in which the learning device 100 continues to execute the episode even when the learning device 100 falls into a state in which policy updating is not expected or when the learning device 100 falls into a state in which policy updating is expected to progress slowly.

　再開状態決定部１９５は、シミュレーション部１９２によるシミュレーションが中断されたときのリセット先を決定する。シミュレーション部１９２は、再開状態決定部１９５が決定したリセット先の状態からシミュレーションを再開する。
　再開状態決定部１９５は、再開状態決定手段の例に該当する。 The restart state determination unit 195 determines a reset destination when the simulation by the simulation unit 192 is interrupted. The simulation unit 192 restarts the simulation from the reset destination state determined by the restart state determination unit 195.
The restart state determination unit 195 corresponds to an example of a restart state determination means.

　上述したように、エピソードの最初は、必ずしも方策の更新が進まないことが考えられる。学習装置１００について上述したように、再開状態決定部１９５が、エピソードを遡る先として、エピソードの最初に限らず途中の時間ステップも選択できることで、学習装置１００が、より効率的に強化学習を行えることが期待される。 As described above, it is conceivable that at the beginning of an episode, policy updates may not necessarily progress. As described above for the learning device 100, the restart state determination unit 195 can select not only the beginning of an episode but also any time step in the middle as the destination to go back in the episode, and this is expected to enable the learning device 100 to perform reinforcement learning more efficiently.

　なお、リセット決定部１９４が、エピソードの実行を中断した場合そのエピソードの最初の状態など予め定められた状態から、シミュレーション部１９２によるシミュレーションを再開させるようにしてもよい。
　この場合、学習装置１００が、再開状態決定部１９５を備えていない構成となっていてもよい。 When the reset decision unit 194 interrupts the execution of an episode, the simulation unit 192 may resume the simulation from a predetermined state, such as the initial state of the episode.
In this case, the learning device 100 may be configured without including the resume state determination unit 195 .

　リセット決定部１９４が、累積報酬値の変化に基づいて、シミュレーション部１９２によるシミュレーションを中断するか否かを決定するようにしてもよい。
　図３は、累積報酬値の計算の元となる瞬時報酬値の例を示す図である。図３のグラフの横軸は時間ステップを示す。縦軸は瞬時報酬値を示す。
　図３の例で、瞬時報酬値は正の値および負の値をとり得るものとする。瞬時報酬値が大きいほど（例えば、正の値で大きさが大きいほど）、瞬時報酬値が示す評価が良いものとする。瞬時報酬値が小さいほど（例えば、負の値で大きさが大きいほど）、瞬時報酬値が示す評価が悪いものとする。
　図３の例で、時間ステップｔ１１では瞬時報酬値が正の値となっており、時間ステップｔ１１の次から瞬時報酬値が負の値となっている時間ステップが連続している。 The reset decision unit 194 may decide whether or not to interrupt the simulation by the simulation unit 192 based on a change in the cumulative reward value.
3 is a diagram showing an example of an instantaneous reward value that is the basis for calculating a cumulative reward value. The horizontal axis of the graph in FIG. 3 indicates a time step, and the vertical axis indicates an instantaneous reward value.
In the example of Fig. 3, the instantaneous reward value can be a positive value or a negative value. The larger the instantaneous reward value (e.g., the larger the positive value and the magnitude), the better the evaluation indicated by the instantaneous reward value. The smaller the instantaneous reward value (e.g., the larger the negative value and the magnitude), the worse the evaluation indicated by the instantaneous reward value.
In the example of FIG. 3, at time step t11, the instantaneous reward value is a positive value, and from time step t11 onwards there are successive time steps in which the instantaneous reward value is a negative value.

　図４は、累積報酬値の例を示す図である。図４のグラフの横軸は時間ステップを示す。縦軸は累積報酬値を示す。
　図４は、図３に示す瞬時報酬値をエピソードの開始から累積して算出される累積報酬値を示している。時間ステップｔ１１で累積報酬値が極大となっており、時間ステップｔ１１の次からは、累積報酬値の減少が継続している。 Fig. 4 is a diagram showing an example of a cumulative reward value. The horizontal axis of the graph in Fig. 4 indicates a time step, and the vertical axis indicates a cumulative reward value.
Fig. 4 shows the cumulative reward value calculated by accumulating the instantaneous reward values shown in Fig. 3 from the start of the episode. The cumulative reward value reaches a maximum at time step t11, and from the time step after t11, the cumulative reward value continues to decrease.

　時間ステップｔ１２の状態など累積報酬値が小さい値となっている状態では、報酬値が示す評価が悪い評価となる要因の事象が発生していることが考えられる。例えば、制御対象９１０が鉄道である場合、列車の間隔が詰まってダイヤ乱れが生じているといったことが考えられる。 When the cumulative reward value is small, such as in the state of time step t12, it is possible that an event has occurred that causes the evaluation indicated by the reward value to be poor. For example, if the control object 910 is a railway, it is possible that the intervals between trains have become too close, causing a disruption to the train schedule.

　報酬値が示す評価が悪い評価となる要因の事象が発生している場合、そのままエピソードの実行を継続しても報酬値が示す評価が良くならず、時間ステップを遡ってエピソードの実行をやり直した方が、方策の更新が進むことが考えられる。
　例えば、制御対象９１０が鉄道であり、列車の間隔が詰まってダイヤ乱れが生じている場合、列車の間隔の詰まりが解消されるまで、報酬値が示す評価が良くならないことが考えられる。報酬値が示す評価が悪いことで、方策生成部１９３による方策の更新が進まないことが考えられる。 If an event occurs that causes the reward value to be poorly evaluated, continuing to execute the episode will not improve the reward value, and it is thought that going back a time step and restarting the episode will result in more progress in updating the policy.
For example, when the control object 910 is a railway and train intervals are short, causing a disruption in the schedule, it is considered that the evaluation indicated by the reward value will not improve until the short train intervals are resolved. It is considered that the poor evaluation indicated by the reward value will not progress in updating the measures by the measure generation unit 193.

　この場合、そのままエピソードの実行を継続するよりも、列車の間隔が詰まる前の時間ステップなど、報酬値が示す評価が悪い評価となる要因の事象が発生する前の時間ステップまで戻ってエピソードの実行をやり直した方が、方策の更新が進むことが期待される。
　図４の例で、時間ステップｔ１１の次の時間ステップからの、累積報酬値の減少が継続している時間ステップでは、報酬値が示す評価が悪い評価となる要因の事象が既に発生していることが考えられる。この場合、時間ステップｔ１１またはそれよりも前の時間ステップまで遡ってエピソードの実行をやり直した方が、方策の更新が進むことが期待される。 In this case, rather than continuing to execute the episode as is, it is expected that the policy will be updated more effectively if the episode is restarted by going back to a time step before the train intervals become shorter, or before an event occurs that causes the reward value to be evaluated poorly.
In the example of Fig. 4, in the time steps from the time step t11 onwards where the cumulative reward value continues to decrease, it is possible that an event that causes the reward value to be poorly evaluated has already occurred. In this case, it is expected that the policy update will progress more if the episode is executed again going back to time step t11 or an earlier time step.

　そこで、リセット決定部１９４が、累積報酬値の変化に基づいて、シミュレーション部１９２によるシミュレーションを中断するか否かを決定するようにしてもよい。
　例えば、リセット決定部１９４が、累積報酬値が示す評価が所定の閾値よりも悪化したと判定した場合に、シミュレーション部１９２によるシミュレーションを中断することに決定するようにしてもよい。図４の例で、リセット決定部１９４が、時間ステップｔ１１における累積報酬値の極大値からの減少量の大きさと閾値ｄ１１とを比較して、減少量の大きさが閾値ｄ１１よりも大きくなっている時間ステップｔ１２またはその終了後に、シミュレーション部１９２によるシミュレーションを中断することに決定するようにしてもよい。 Therefore, the reset decision unit 194 may decide whether or not to interrupt the simulation by the simulation unit 192 based on the change in the accumulated reward value.
For example, when the reset decision unit 194 determines that the evaluation indicated by the cumulative reward value has deteriorated below a predetermined threshold, the reset decision unit 194 may decide to interrupt the simulation by the simulation unit 192. In the example of Fig. 4, the reset decision unit 194 may compare the magnitude of the decrease amount from the maximum value of the cumulative reward value at the time step t11 with the threshold value d11, and may decide to interrupt the simulation by the simulation unit 192 at or after the end of the time step t12 where the magnitude of the decrease amount is greater than the threshold value d11.

　この場合、リセット決定部１９４が、累積報酬値の閾値が示す評価の悪化の幅が、シミュレーション部１９２によるシミュレーションの進行に応じて大きくなるように、閾値を更新するようにしてもよい。図４の例で、リセット決定部１９４が、時間ステップの進行に応じて、閾値ｄ１１の値が大きくなるように、閾値ｄ１１の値を更新するようにしてもよい。 In this case, the reset decision unit 194 may update the threshold value so that the extent of deterioration in evaluation indicated by the threshold value of the cumulative reward value increases as the simulation by the simulation unit 192 progresses. In the example of FIG. 4, the reset decision unit 194 may update the value of the threshold value d11 so that the value of the threshold value d11 increases as the time steps progress.

　リセット決定部１９４が累積報酬値の減少量を計算する起算点は、累積報酬値の極大値に限定されない。例えば、リセット決定部１９４が、累積報酬値０からの減少量を算出して閾値と比較するようにしてもよい。あるいは、リセット決定部１９４が、累積報酬値の最大値からの減少量を算出して閾値と比較するようにしてもよい。 The starting point from which the reset determination unit 194 calculates the amount of decrease in the cumulative reward value is not limited to the maximum value of the cumulative reward value. For example, the reset determination unit 194 may calculate the amount of decrease from a cumulative reward value of 0 and compare it with a threshold value. Alternatively, the reset determination unit 194 may calculate the amount of decrease from the maximum value of the cumulative reward value and compare it with a threshold value.

　ここで、エピソードの初期の段階、あるいは、強化学習の初期の段階では、学習が進んでいないことで、報酬値が大きくならないケースが多いことが考えられる。この場合、リセット決定部１９４が閾値を比較的小さい値に設定してエピソードの実行の中断を早めることで、行動決定部１９１が、いろいろな行動を試すようになり、報酬値が大きくなる行動（または一連の行動）を比較的早い時点で見つけられることが期待される。 Here, it is considered that in the early stages of an episode or in the early stages of reinforcement learning, there are many cases in which the reward value does not become large because learning has not progressed. In this case, by the reset decision unit 194 setting the threshold to a relatively small value and suspending the execution of the episode earlier, it is expected that the action decision unit 191 will try various actions and find an action (or series of actions) that will result in a large reward value at a relatively early stage.

　あるいは、リセット決定部１９４が、累積報酬値の増減に基づいて、累積報酬値が示す評価の悪化が所定の時間ステップ数以上継続した場合に、シミュレーション部１９２によるシミュレーションを中断することに決定するようにしてもよい。図４の例で、シミュレーション部１９２によるシミュレーションを中断する時間ステップ数が７に設定されている場合、リセット決定部１９４が、累積報酬値が極大になっている時間ステップｔ１１から連続して７回減少している時間ステップｔ１２またはその終了後に、シミュレーション部１９２によるシミュレーションを中断することに決定するようにしてもよい。 Alternatively, the reset decision unit 194 may decide to interrupt the simulation by the simulation unit 192 when the deterioration of the evaluation indicated by the cumulative reward value continues for a predetermined number of time steps or more, based on the increase or decrease in the cumulative reward value. In the example of FIG. 4, if the number of time steps for interrupting the simulation by the simulation unit 192 is set to 7, the reset decision unit 194 may decide to interrupt the simulation by the simulation unit 192 at or after the end of time step t12 in which the cumulative reward value has decreased seven times consecutively from time step t11 at which the cumulative reward value is maximized.

　リセット決定部１９４が、累積報酬値の減少量と、累積報酬値が連続して減少している時間ステップの回数との組み合わせに基づいて、シミュレーション部１９２によるシミュレーションを中断するか否かを決定するようにしてもよい。例えば、リセット決定部１９４が、累積報酬値の減少量の大きさが所定の閾値よりも大きくなった場合、および、累積報酬値が連続して減少している時間ステップが、所定の回数以上となった場合の何れも、シミュレーション部１９２によるシミュレーションを中断することに決定するようにしてもよい。 The reset decision unit 194 may decide whether or not to interrupt the simulation by the simulation unit 192 based on a combination of the amount of decrease in the cumulative reward value and the number of time steps in which the cumulative reward value has continuously decreased. For example, the reset decision unit 194 may decide to interrupt the simulation by the simulation unit 192 when the amount of decrease in the cumulative reward value becomes greater than a predetermined threshold value, and when the number of time steps in which the cumulative reward value has continuously decreased reaches or exceeds a predetermined number.

　リセット決定部１９４が、累積報酬値に加えて、あるいは、累積報酬値に代えて、価値関数値に基づいて、シミュレーション部１９２によるシミュレーションを中断するか否かを決定するようにしてもよい。価値関数値は累積報酬値の予測値であり、価値関数値と累積報酬値との間には正の相関関係があると考えられる。リセット決定部１９４が、価値関数値に基づいて、シミュレーション部１９２によるシミュレーションを中断するか否かを決定することで、累積報酬値の基づく場合と同様、報酬値が示す評価が悪い評価となる要因の事象が発生している場合に、シミュレーション部１９２によるシミュレーションを中断するようになり、方策の更新が進むことが期待される。 The reset decision unit 194 may decide whether or not to suspend the simulation by the simulation unit 192 based on the value function value in addition to or instead of the cumulative reward value. The value function value is a predicted value of the cumulative reward value, and it is considered that there is a positive correlation between the value function value and the cumulative reward value. By having the reset decision unit 194 decide whether or not to suspend the simulation by the simulation unit 192 based on the value function value, the simulation by the simulation unit 192 will be suspended when an event occurs that causes the evaluation indicated by the reward value to be a bad evaluation, just as in the case where it is based on the cumulative reward value, and it is expected that the update of the policy will progress.

　上述した累積報酬値の場合と同様、リセット決定部１９４が、価値関数値の増減に基づいて、価値関数値が示す評価が所定の閾値よりも悪化したと判定した場合、シミュレーション部１９２によるシミュレーションを中断するようにしてもよい。
　この場合、上述した累積報酬値の場合と同様、リセット決定部１９４が、価値関数値の閾値が示す評価の悪化の幅が、シミュレーション部１９２によるシミュレーションの進行に応じて大きくなるように、閾値を更新するようにしてもよい。 As in the case of the cumulative reward value described above, if the reset decision unit 194 determines, based on an increase or decrease in the value function value, that the evaluation indicated by the value function value has deteriorated below a predetermined threshold value, the simulation by the simulation unit 192 may be interrupted.
In this case, as in the case of the cumulative reward value described above, the reset determination unit 194 may update the threshold value so that the extent of the deterioration in evaluation indicated by the threshold value of the value function value increases as the simulation by the simulation unit 192 progresses.

　また、上述した累積報酬値の場合と同様、リセット決定部１９４が、価値関数値の増減に基づいて、前記価値関数値が示す評価の悪化が所定の時間ステップ数以上継続した場合、シミュレーション部１９２によるシミュレーションを中断することに決定するようにしてもよい。
　上述した累積報酬値の場合と同様、リセット決定部１９４が、価値関数値の減少量と、価値関数値が連続して減少している時間ステップの回数との組み合わせに基づいて、シミュレーション部１９２によるシミュレーションを中断するか否かを決定するようにしてもよい。 Also, as in the case of the cumulative reward value described above, the reset decision unit 194 may decide to interrupt the simulation by the simulation unit 192 if the deterioration of the evaluation indicated by the value function value continues for a predetermined number of time steps or more, based on an increase or decrease in the value function value.
As in the case of the cumulative reward value described above, reset determination unit 194 may determine whether or not to interrupt the simulation by simulation unit 192 based on a combination of the amount of decrease in the value function value and the number of time steps in which the value function value continuously decreases.

　リセット決定部１９４が、現在の状態（現在の時間ステップにおける状態）が、中断が必要な状態として予め設定されている１つ以上の状態の何れかと一定以上類似していると判定した場合、シミュレーション部１９２によるシミュレーションを中断するようにしてもよい。
　例えば、制御対象９１０が鉄道である場合、記憶部１８０が、列車の間隔が詰まってダイヤ乱れが生じている状態を、中断が必要な状態として複数パターン記憶しておく。そして、リセット決定部１９４が、現在の状態と、中断が必要な状態のそれぞれとを比較して、現在の状態が、中断が必要な状態の何れかと類似しているか否かを判定する。現在の状態が、中断が必要な状態の何れかと類似していると判定した場合、リセット決定部１９４は、シミュレーション部１９２によるシミュレーションを中断することに決定する。 If the reset decision unit 194 determines that the current state (the state at the current time step) is similar to one or more states that are pre-set as states requiring interruption to a certain degree or more, the simulation unit 192 may be configured to interrupt the simulation.
For example, when the control target 910 is a railway, the storage unit 180 stores a plurality of patterns of states in which train intervals are short and disruption of the schedule occurs as states requiring interruption. The reset decision unit 194 then compares the current state with each of the states requiring interruption and determines whether the current state is similar to any of the states requiring interruption. If it is determined that the current state is similar to any of the states requiring interruption, the reset decision unit 194 decides to interrupt the simulation by the simulation unit 192.

　リセット決定部１９４が、現在の状態と中断が必要な状態が類似しているか否かを判定する方法は、特定の方法に限定されない。
　例えば、リセット決定部１９４が、２つの状態それぞれの特徴量をベクトルで算出し、コサイン類似度などベクトルの類似度を算出するようにしてもよい。そして、リセット決定部１９４が、算出した類似度と閾値とを比較して、類似度が閾値以上である場合に、２つの状態が類似していると判定するようにしてもよい。あるいは、２つの状態が類似しているか否かを判定する機械学習モデルを用意しておき、リセット決定部１９４が、この機械学習モデルを用いて、現在の状態と中断が必要な状態が類似しているか否かを判定するようにしてもよい。 The method by which the reset determination unit 194 determines whether the current state is similar to the state requiring interruption is not limited to a specific method.
For example, the reset determination unit 194 may calculate the feature amount of each of the two states as a vector, and calculate the similarity of the vectors, such as cosine similarity. The reset determination unit 194 may then compare the calculated similarity with a threshold, and determine that the two states are similar if the similarity is equal to or greater than the threshold. Alternatively, a machine learning model that determines whether or not the two states are similar may be prepared, and the reset determination unit 194 may use this machine learning model to determine whether or not the current state is similar to the state requiring interruption.

　学習装置１００が、中断が必要な状態を収集する際、状態をユーザに提示して、中断が必要な状態のユーザによる指定を受け付けるようにしてもよい。
　例えば、リセット決定部１９４が、シミュレーションにおける現在の状態または過去の状態など何らかの状態を表示部１２０に状態を表示させるようにしてもよい。そして、リセット決定部１９４が、操作入力部１３０を介して、表示している状態を中断が必要な状態として登録するか否かを指示するユーザ操作を受け付けるようにしてもよい。 When the learning device 100 collects states requiring interruption, the learning device 100 may present the states to the user and accept the user's designation of the states requiring interruption.
For example, the reset determination unit 194 may cause some state, such as a current state or a past state in the simulation, to be displayed on the display unit 120. The reset determination unit 194 may then receive a user operation via the operation input unit 130 instructing whether or not to register the displayed state as a state requiring interruption.

　この場合、リセット決定部１９４と表示部１２０との組み合わせは、状態提示手段の例に該当する。リセット決定部１９４と操作入力部１３０との組み合わせは、状態指定受け付け手段の例に該当する。
　制御対象９１０が鉄道である場合、表示部１２０が、シミュレーションにおける現在の列車運行状況をダイヤグラムの形式で表示するようにしてもよい。そして、ユーザが、表示されている運行状況が、中断が必要な状況であると判断したときに、そのことを示すユーザ操作を操作入力部１３０にて行い、リセット決定部１９４が、そのユーザ操作が行われたことを検出するようにしてもよい。 In this case, the combination of the reset determination unit 194 and the display unit 120 corresponds to an example of a state presenting unit, and the combination of the reset determination unit 194 and the operation input unit 130 corresponds to an example of a state designation receiving unit.
When the control target 910 is a railway, the display unit 120 may display the current train operation status in the simulation in the form of a diagram. When the user determines that the displayed operation status is one that requires interruption, the user may perform a user operation indicating this on the operation input unit 130, and the reset determination unit 194 may detect that the user operation has been performed.

　学習装置１００は、リセット決定部１９４と表示部１２０と操作入力部１３０とを備えている点で、入出力装置の例に該当する。あるいは、入出力装置が、学習装置１００とは別の装置として構成されていてもよい。例えば、学習装置１００の端末装置が、学習装置１００からの指示に従って状態を表示する機能と、中断が必要な状態を指定するユーザ操作を受け付けて学習装置１００に通知する機能とを有していてもよい。 The learning device 100 is an example of an input/output device in that it is equipped with a reset decision unit 194, a display unit 120, and an operation input unit 130. Alternatively, the input/output device may be configured as a device separate from the learning device 100. For example, the terminal device of the learning device 100 may have a function of displaying the status according to instructions from the learning device 100, and a function of accepting a user operation specifying a state requiring interruption and notifying the learning device 100.

　再開状態決定部１９５が、シミュレーションの中断時まで実行されていたエピソード内のリセット先候補のうち、累積報酬値の増減で示される評価の変化が良化から悪化に転じている時間ステップのうちシミュレーションの中断時との間の時間ステップ数が最も少ない時間ステップまたはそれよりも前の時間ステップに相当するリセット先候補を選択するようにしてもよい。すなわち、再開状態決定部１９５が、シミュレーションの中断時に直近で評価が極大に良い評価となっている時間ステップまたはそれよりも前の時間ステップに相当するリセット先候補を選択するようにしてもよい。 The restart state determination unit 195 may select, from among the reset destination candidates in the episode that was being executed until the simulation was interrupted, a reset destination candidate that corresponds to a time step with the fewest number of time steps between the time the simulation was interrupted and the time step in which the change in evaluation, indicated by an increase or decrease in the cumulative reward value, has turned from improvement to deterioration, or a time step prior to that. In other words, the restart state determination unit 195 may select a reset destination candidate that corresponds to a time step in which the evaluation was maximally good immediately before the interruption of the simulation, or a time step prior to that.

　図４の例で、リセット決定部１９４が時間ステップｔ１２でシミュレーション部１９２によるシミュレーションを中断した場合、時間ステップｔ１１が、シミュレーションの中断時（時間ステップｔ１２）に直近で累積報酬値が極大になっている時間ステップに該当する。再開状態決定部１９５が、時間ステップｔ１１またはそれ以前の時間ステップに相当するリセット先候補を選択するようにしてもよい。 In the example of FIG. 4, if the reset determination unit 194 interrupts the simulation by the simulation unit 192 at time step t12, time step t11 corresponds to the time step at which the cumulative reward value is maximized immediately before the simulation is interrupted (time step t12). The restart state determination unit 195 may select a reset destination candidate that corresponds to time step t11 or an earlier time step.

　シミュレーションの中断時に直近で評価が極大に良い評価となっている時間ステップよりも後の時間ステップでは、累積報酬値が継続的に減少しており、報酬値が示す評価が悪い評価となる要因の事象が発生していることが考えられる。再開状態決定部１９５が、シミュレーションの中断時に直近で評価が極大に良い評価となっている時間ステップまたはそれよりも前の時間ステップに相当するリセット先候補を選択することで、報酬値が示す評価が悪い評価となる要因の事象が発生する前の時間ステップを選択する可能性が高まることが期待される。 In time steps after the time step in which the evaluation was the most recent maximally good evaluation when the simulation was interrupted, the cumulative reward value is continuously decreasing, and it is possible that an event has occurred that causes the evaluation indicated by the reward value to be a bad evaluation. By having the restart state determination unit 195 select a reset destination candidate that corresponds to the time step in which the evaluation was the most recent maximally good evaluation when the simulation was interrupted or an earlier time step, it is expected that the possibility of selecting a time step before the occurrence of an event that causes the evaluation indicated by the reward value to be a bad evaluation will increase.

　あるいは、再開状態決定部１９５が、シミュレーションの中断時まで実行されていたエピソード内の全てのリセット先候補を選択の対象として、何れか１つのリセット先候補を選択するようにしてもよい。
　これにより、累積報酬値が減少していることが、必ずしも報酬値が示す評価が悪い評価となる要因の事象が発生していることを示さない場合、再開状態決定部１９５が、より後のリセット先候補（シミュレーションの中断時により近いリセット先候補）を選択できる可能性がある。 Alternatively, the resume state determination unit 195 may select one of all reset destination candidates within the episode that was being executed up until the simulation was interrupted as selection targets.
As a result, if a decrease in the cumulative reward value does not necessarily indicate the occurrence of an event that will cause the reward value to have a bad evaluation, the resume state determination unit 195 may be able to select a later reset destination candidate (a reset destination candidate closer to the time the simulation was interrupted).

　再開状態決定部１９５が、複数のリセット先候補に設定されている確率分布に従って、それら複数のリセット先候補のうち何れか１つを選択するようにしてもよい。
　ここで、報酬値が示す評価が悪い評価となる要因の事象が発生している場合に、その事象が発生する前の時間ステップまで戻る観点からは、エピソード内でなるべく早い時間ステップ（エピソードの最初に近い時間ステップ）まで戻ることが考えられる。一方、エピソードの最初の方における学習の繰り返しを軽減させる観点からは、エピソード内で実行済みの時間ステップのうち、なるべく遅い時間ステップまで戻ることが考えられる。このように、報酬値が示す評価が悪い評価となる要因の事象が発生している状態を避けて強化学習の効率化を図ることと、エピソードの最初の方における学習の繰り返しを軽減させて強化学習の効率化を図ることとは、トレードオフの関係にある。 The resume state determination unit 195 may select one of the multiple reset destination candidates in accordance with a probability distribution set for the multiple reset destination candidates.
Here, when an event that causes the reward value to be evaluated poorly occurs, from the viewpoint of returning to a time step before the event occurred, it is possible to return to as early a time step in the episode as possible (a time step close to the beginning of the episode). On the other hand, from the viewpoint of reducing the number of repetitions of learning at the beginning of the episode, it is possible to return to as late a time step as possible among the time steps that have been executed in the episode. In this way, there is a trade-off between achieving efficiency in reinforcement learning by avoiding a state in which an event that causes the reward value to be evaluated poorly occurs, and achieving efficiency in reinforcement learning by reducing the number of repetitions of learning at the beginning of the episode.

　そこで、再開状態決定部１９５が、複数のリセット先候補のうちの何れか１つを確率的に選択する。これにより、エピソードの実行が中断されるごとに毎回エピソードの最初の方からエピソードの実行を再開することを避けることができる。また、選択したリセット先候補では、報酬値が示す評価が悪い評価となる要因の事象が既に発生している場合、エピソードの中断をさらに１回以上繰り返すことで、報酬値が示す評価が悪い評価となる要因の事象が発生する前のリセット先候補を選択できると期待される。 The restart state determination unit 195 therefore probabilistically selects one of the multiple reset destination candidates. This makes it possible to avoid restarting the execution of the episode from the beginning every time the execution of the episode is interrupted. Furthermore, if an event that would cause the reward value to be evaluated poorly has already occurred in the selected reset destination candidate, it is expected that by interrupting the episode one or more times, it will be possible to select a reset destination candidate that was selected before the event that would cause the reward value to be evaluated poorly occurred.

　再開状態決定部１９５が、複数のリセット先候補のうち何れか１つを一様分布の確率分布に従って選択するようにしてもよい。
　これにより、再開状態決定部１９５は、報酬値が示す評価が悪い評価となる要因の事象が発生する前の状態に到達するために遡る時間ステップ数の目安が不明な制御対象９１０に対応して、リセット先候補を選択することができる。 The restart state determination unit 195 may select one of the multiple reset destination candidates in accordance with a uniform probability distribution.
This allows the restart state determination unit 195 to select a reset destination candidate in response to a control object 910 for which the approximate number of time steps to go back to reach a state before the occurrence of an event that caused the evaluation indicated by the reward value to be poorly evaluated is unknown.

　図５は、リセット先候補に設定されている確率分布の第１の例を示す図である。図５のグラフの横軸は時間ステップを表す。縦軸は確率を表す。
　図５の例では、リセット決定部１９４は、時間ステップｔ２５よりも後の時間ステップでシミュレーション部１９２によるシミュレーションを中断しており、時間ステップｔ２１からｔ２５までの５つの時間ステップがそれぞれリセット先候補となっている。再開状態決定部１９５は、これら５つのリセット先候補のうち時間ステップｔ２１からｔ２３までの３つの時間ステップを、選択対象のリセット先候補として、これらのリセット先候補に一様分布の確率分布を設定している。再開状態決定部１９５は、設定した確率分布に従って、３つのリセット先候補の何れも３分の１の確率で選択する。 5 is a diagram showing a first example of a probability distribution set for reset destination candidates, in which the horizontal axis of the graph in Fig. 5 represents time steps and the vertical axis represents probability.
5, the reset determination unit 194 suspends the simulation by the simulation unit 192 at a time step after time step t25, and the five time steps from time step t21 to t25 are reset destination candidates. The restart state determination unit 195 sets a uniform probability distribution for these reset destination candidates, selecting three time steps from time step t21 to t23 among these five reset destination candidates as reset destination candidates to be selected. The restart state determination unit 195 selects each of the three reset destination candidates with a one-third probability according to the set probability distribution.

　あるいは、再開状態決定部１９５が、リセット先候補に対して、累積報酬値の増減で示される、シミュレーションの中断時における評価の悪化が大きいほど、シミュレーションの中断時とリセット先候補との間の時間ステップ数が多いリセット先候補を選び易くなるように確率分布を設定し、設定した確率分布に従って何れか１つのリセット先候補を選択するようにしてもよい。 Alternatively, the restart state determination unit 195 may set a probability distribution for the reset destination candidates such that the greater the deterioration in evaluation at the time the simulation is interrupted, as indicated by an increase or decrease in the cumulative reward value, the more likely it is that a reset destination candidate with a large number of time steps between the time the simulation is interrupted and the reset destination candidate will be selected, and one of the reset destination candidates may be selected according to the set probability distribution.

　図６は、リセット先候補に設定されている確率分布の第２の例を示す図である。図６のグラフの横軸は時間ステップを表す。縦軸は確率を表す。
　図６の例では、リセット決定部１９４は、時間ステップｔ２５よりも後の時間ステップでシミュレーション部１９２によるシミュレーションを中断しており、時間ステップｔ２１からｔ２５までの５つの時間ステップがそれぞれリセット先候補となっている。再開状態決定部１９５は、これら５つのリセット先候補のうち時間ステップｔ２１からｔ２３までの３つの時間ステップを、選択対象のリセット先候補として、これらのリセット先候補に確率分布を設定している。 6 is a diagram showing a second example of a probability distribution set for reset destination candidates, in which the horizontal axis of the graph in Fig. 6 represents time steps and the vertical axis represents probability.
6, the reset determination unit 194 suspends the simulation by the simulation unit 192 at a time step after time step t25, and the five time steps from time step t21 to t25 are reset destination candidates. The restart state determination unit 195 sets three time steps from time step t21 to t23 out of these five reset destination candidates as reset destination candidates to be selected, and sets a probability distribution for these reset destination candidates.

　図６は、シミュレーションの中断時における累積報酬値の減少量の大きさが比較的小さい場合の例を示しており、中断時に近いリセット先候補ほど（すなわち、中断時との間の時間ステップが少ないリセット先候補ほど）大きい確率を設定している。
　再開状態決定部１９５は、設定した確率分布に従って、３つのリセット先候補の何れかを選択する。これにより、再開状態決定部１９５は、中断時に近いリセット先候補（すなわち、中断時との間の時間ステップが少ないリセット先候補）を比較的選び易い。 FIG. 6 shows an example in which the amount of decrease in the cumulative reward value when the simulation is interrupted is relatively small, and a higher probability is set for a reset destination candidate that is closer to the interruption (i.e., a reset destination candidate with fewer time steps between the interruption and the simulation).
The restart state determination unit 195 selects one of the three reset destination candidates according to the set probability distribution. This makes it relatively easy for the restart state determination unit 195 to select a reset destination candidate that is close to the interruption time (i.e., a reset destination candidate with a small time step between the interruption time and the reset destination candidate).

　図７は、リセット先候補に設定されている確率分布の第３の例を示す図である。図７のグラフの横軸は時間ステップを表す。縦軸は確率を表す。
　図７の例では、リセット決定部１９４は、時間ステップｔ２５よりも後の時間ステップでシミュレーション部１９２によるシミュレーションを中断しており、時間ステップｔ２１からｔ２５までの５つの時間ステップがそれぞれリセット先候補となっている。再開状態決定部１９５は、これら５つのリセット先候補のうち時間ステップｔ２１からｔ２３までの３つの時間ステップを、選択対象のリセット先候補として、これらのリセット先候補に確率分布を設定している。 7 is a diagram showing a third example of a probability distribution set for reset destination candidates. The horizontal axis of the graph in Fig. 7 represents time steps, and the vertical axis represents probability.
7, the reset determination unit 194 suspends the simulation by the simulation unit 192 at a time step after time step t25, and the five time steps from time step t21 to t25 are reset destination candidates. The resume state determination unit 195 sets three time steps from time step t21 to t23 out of these five reset destination candidates as reset destination candidates to be selected, and sets a probability distribution for these reset destination candidates.

　図７は、シミュレーションの中断時における累積報酬値の減少量の大きさが比較的大きい場合の例を示している。図７の例では、中断時から比較的近いリセット先候補である時間ステップｔ２１に対しては、図６の場合よりも大きい確率が設定されている。また、中断時に比較的近いリセット先候補である時間ステップｔ２３に対しては、図６の場合よりも小さい確率を設定している。
　再開状態決定部１９５は、設定した確率分布に従って、３つのリセット先候補の何れかを選択する。これにより、再開状態決定部１９５は、中断時から遠いリセット先候補（すなわち、中断時との間の時間ステップが多いリセット先候補）を比較的選び易い。 Fig. 7 shows an example in which the magnitude of the decrease in the cumulative reward value at the time of interruption of the simulation is relatively large. In the example of Fig. 7, a higher probability is set for time step t21, which is a candidate for the reset destination relatively close to the time of interruption, than in Fig. 6. Also, a lower probability is set for time step t23, which is a candidate for the reset destination relatively close to the time of interruption, than in Fig. 6.
The restart state determination unit 195 selects one of the three reset destination candidates according to the set probability distribution. This makes it relatively easy for the restart state determination unit 195 to select a reset destination candidate that is far from the interruption time (i.e., a reset destination candidate with many time steps between the interruption time and the reset destination candidate).

　ここで、累積報酬値が急激に減少する状況は、特に避けたいと考えられる。累積報酬値が急激に減少する状況を回避できる可能性を高めるために、シミュレーションの中断時に遡る時間ステップ数を多くとることが考えられる。
　一方、累積報酬値が緩やかに減少する状況は、累積報酬値が急激に減少する状況と比較すると、避けたい度合いは小さいことが考えられる。この場合、エピソードの最初の方における学習の繰り返しを軽減させるために、シミュレーションの中断時に遡る時間ステップ数を比較的少なくすることが考えられる。
　再開状態決定部１９５が、累積報酬値の増減の大きさに応じて、選択しやすいリセット先候補を変えることで、学習装置１００が、強化学習を効率的に行えると期待される。 Here, it is particularly desirable to avoid a situation in which the cumulative reward value drops suddenly. In order to increase the possibility of avoiding a situation in which the cumulative reward value drops suddenly, it is possible to take a large number of time steps back to the time when the simulation was interrupted.
On the other hand, a situation where the cumulative reward value gradually decreases is less desirable than a situation where the cumulative reward value rapidly decreases. In this case, in order to reduce the repetition of learning at the beginning of an episode, it is conceivable to make the number of time steps back when the simulation is interrupted relatively small.
It is expected that the learning device 100 can efficiently perform reinforcement learning by the restart state determination unit 195 changing the reset destination candidates that are easy to select depending on the magnitude of the increase or decrease in the cumulative reward value.

　図８は、学習装置１００におけるデータの入出力の例を示す図である。
　図８の例で、行動決定部１９１は、方策に基づいて制御対象９１０の行動を決定し、決定した行動をシミュレーション部１９２へ出力する。
　シミュレーション部１９２は、行動決定部１９１が決定した行動を模擬して状態（次状態）を算出し、算出した状態を方策生成部１９３へ出力する。 FIG. 8 is a diagram showing an example of data input/output in the learning device 100.
In the example of FIG. 8, the action decision unit 191 decides the action of the control target 910 based on the measure, and outputs the decided action to the simulation unit 192 .
The simulation unit 192 simulates the action determined by the action determination unit 191 to calculate a state (next state), and outputs the calculated state to the measure generation unit 193.

　方策生成部１９３は、行動決定部１９１から取得した状態に基づいて報酬値を算出し、算出した報酬値に基づいて方策を更新する。報酬値によっては、方策生成部１９３が方策を更新しない（そのままとする）場合もある。
　また、方策生成部１９３は、算出した報酬値をリセット決定部１９４および再開状態決定部１９５へ出力する。 The policy generation unit 193 calculates a reward value based on the state acquired from the action decision unit 191, and updates the policy based on the calculated reward value. Depending on the reward value, the policy generation unit 193 may not update the policy (may leave it as is).
In addition, the measure generator 193 outputs the calculated reward value to the reset determiner 194 and the restart state determiner 195 .

　リセット決定部１９４は、報酬値に基づいて、シミュレーション部１９２によるシミュレーションを中断するか否かを決定する。リセット決定部１９４が、シミュレーションを中断することに決定した場合、再開状態決定部１９５は、報酬値に基づいて、リセット先を決定する。そして、リセット決定部１９４および再開状態決定部１９５は、シミュレーション中断の指示とリセット先とを、シミュレーション部１９２および方策生成部１９３へ出力する。 The reset decision unit 194 decides whether or not to interrupt the simulation by the simulation unit 192 based on the reward value. If the reset decision unit 194 decides to interrupt the simulation, the restart state decision unit 195 decides the reset destination based on the reward value. The reset decision unit 194 and the restart state decision unit 195 then output an instruction to interrupt the simulation and the reset destination to the simulation unit 192 and the measure generation unit 193.

　ここで、方策生成部１９３が用いる報酬値、リセット決定部１９４が用いる報酬値、および、再開状態決定部１９５が用いる報酬値は、特定のものに限定されない。方策生成部１９３が用いる報酬値、リセット決定部１９４が用いる報酬値、および、再開状態決定部１９５が用いる報酬値は、瞬時報酬値、累積報酬値、価値関数値、または、他の報酬値の何れかであってもよいし、これらの組み合わせであってもよい。 Here, the reward value used by the policy generation unit 193, the reward value used by the reset determination unit 194, and the reward value used by the restart state determination unit 195 are not limited to a specific one. The reward value used by the policy generation unit 193, the reward value used by the reset determination unit 194, and the reward value used by the restart state determination unit 195 may be an instantaneous reward value, a cumulative reward value, a value function value, or any other reward value, or may be a combination of these.

　また、方策生成部１９３が用いる報酬値と、リセット決定部１９４が用いる報酬値と、再開状態決定部１９５が用いる報酬値とは、同じものであってもよいし異なるものであってもよい。例えば、方策生成部１９３がリセット決定部１９４へ瞬時報酬値を出力し、リセット決定部１９４が、瞬時報酬値に基づいて累積報酬値を算出するようにしてもよい。 Furthermore, the reward value used by the policy generation unit 193, the reward value used by the reset determination unit 194, and the reward value used by the restart state determination unit 195 may be the same or different. For example, the policy generation unit 193 may output an instantaneous reward value to the reset determination unit 194, and the reset determination unit 194 may calculate a cumulative reward value based on the instantaneous reward value.

　中断指示を受けた場合、シミュレーション部１９２は、シミュレーションを中断し、指定されたリセット先へシミュレーションの状態を戻す。
　また、中断指示を受けた場合、方策生成部１９３は、方策または報酬値の何れか、あるいはこれら両方を、指定されたリセット先における値に戻す。なお、方策生成部１９３が、シミュレーションの中断時に方策および報酬値の何れも変更しないようにしてもよい。この場合、リセット決定部１９４および再開状態決定部１９５が、方策生成部１９３へ中断指示およびリセット先を出力しないようにしてもよい。 When an interrupt instruction is received, the simulation unit 192 interrupts the simulation and returns the state of the simulation to the designated reset destination.
Furthermore, when an interruption instruction is received, the policy generation unit 193 returns either the policy or the reward value, or both, to the values at the specified reset destination. The policy generation unit 193 may not change either the policy or the reward value when the simulation is interrupted. In this case, the reset determination unit 194 and the restart state determination unit 195 may not output the interruption instruction and the reset destination to the policy generation unit 193.

　図９は、学習装置１００が行う処理の手順の例を示す図である。
　図９に示す処理で、シミュレーション部１９２は、シミュレーションの設定を行う（ステップＳ１０１）。エピソードの開始時には、シミュレーション部１９２は、そのエピソードで指定されている、そのエピソードの状態を設定する。
　次に、処理部１９０は、強化学習を１ステップ実行する（ステップＳ１０２）。
　そして、リセット決定部１９４は、中断条件が成立しているか否かを判定する（ステップＳ１０３）。 FIG. 9 is a diagram illustrating an example of a procedure of processing performed by the learning device 100.
9, the simulation unit 192 sets up a simulation (step S101). At the start of an episode, the simulation unit 192 sets the state of the episode that is specified in the episode.
Next, the processing unit 190 executes one step of reinforcement learning (step S102).
Then, the reset determination unit 194 determines whether or not the interruption condition is met (step S103).

　中断条件が成立しているとリセット決定部１９４が判定した場合（ステップＳ１０３：ＹＥＳ）、再開状態決定部１９５は、リセット先を決定する（ステップＳ１１１）。
　ステップＳ１１１の後、処理がステップＳ１０１へ戻る。この場合、シミュレーション部１９２はステップＳ１０１で、シミュレーションにおける状態を、リセット先の状態に設定する。 When the reset determination unit 194 determines that the interruption condition is met (step S103: YES), the resume state determination unit 195 determines a reset destination (step S111).
After step S111, the process returns to step S101. In this case, in step S101, the simulation unit 192 sets the state in the simulation to the state after reset.

　一方、ステップＳ１０３で中断条件が成立していないとリセット決定部１９４が判定した場合（ステップＳ１０３：ＮＯ）、処理部１９０は、エピソードの終了条件が成立しているか否かを判定する（ステップＳ１２１）。
　エピソードの終了条件が成立していないと処理部１９０が判定した場合（ステップＳ１２１：ＮＯ）、処理がステップＳ１０２へ戻る。 On the other hand, if the reset determination unit 194 determines in step S103 that the interruption condition is not met (step S103: NO), the processing unit 190 determines whether or not the episode end condition is met (step S121).
When the processing unit 190 determines that the episode end condition is not met (step S121: NO), the process returns to step S102.

　一方、ステップＳ１０３でエピソードの終了条件が成立していると判定した場合（ステップＳ１２１：ＹＥＳ）、処理部１９０は、強化学習の終了条件が成立しているか否かを判定する（ステップＳ１３１）。
　強化学習の終了条件が成立していないと判定した場合（ステップＳ１３１：ＮＯ）、処理部１９０は、次のエピソードを選択する（ステップＳ１４１）。
　ステップＳ１４１の後処理がステップＳ１０１へ戻る。この場合、シミュレーション部１９２は、処理部１９０が選択したエピソードで指定されている、そのエピソードの状態を設定する。
　一方、ステップＳ１３１で、強化学習の終了条件が成立していると処理部１９０が判定した場合（ステップＳ１３１：ＮＯ）、学習装置１００は、図９の処理を終了する。 On the other hand, when it is determined in step S103 that the end condition of the episode is satisfied (step S121: YES), the processing unit 190 determines whether or not the end condition of the reinforcement learning is satisfied (step S131).
When it is determined that the condition for ending the reinforcement learning is not satisfied (step S131: NO), the processing unit 190 selects the next episode (step S141).
After the process of step S141, the process returns to step S101. In this case, the simulation unit 192 sets the state of the episode that is specified in the episode selected by the processing unit 190.
On the other hand, in step S131, if the processing unit 190 determines that the condition for ending the reinforcement learning is satisfied (step S131: NO), the learning device 100 ends the process in FIG.

　以上のように、方策生成部１９３は、学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象９１０の行動に対する評価を示す値である報酬値に基づいて、行動の決定規則である方策を生成する。行動決定部１９１は、制御対象の行動を方策に基づいて決定する。 As described above, the policy generation unit 193 generates a policy, which is a decision rule for behavior, based on a reward value, which is a value indicating an evaluation of the behavior of the control object 910 at a time step back within an episode representing a learning period. The behavior decision unit 191 decides the behavior of the control object based on the policy.

　学習装置１００によれば、エピソード内で時間ステップを遡ることで、より効率的に強化学習を行えることが期待される。
　例えば、学習装置１００によれば、報酬値が示す評価が悪くなる要因が発生した時間ステップよりも過去に遡って、報酬値が示す評価が悪くなる要因の発生を回避できる可能性がある。これにより、学習装置１００では、エピソード全体での評価が良くなるような方策を、より早く生成できることが期待される。 According to the learning device 100, it is expected that reinforcement learning can be performed more efficiently by going back in time steps within an episode.
For example, the learning device 100 may be able to avoid the occurrence of a factor that would cause the evaluation indicated by the reward value to deteriorate by going back to a time step in which the factor occurred that would cause the evaluation indicated by the reward value to deteriorate. This is expected to enable the learning device 100 to generate a measure that will improve the evaluation of the entire episode more quickly.

　また、リセット決定部１９４は、制御対象９１０の行動のシミュレーションを中断する条件として予め定められている条件が成立していると判定した場合、シミュレーションを中断する。方策生成部１９３は、制御対象９１０の行動のシミュレーションに基づいて計算される報酬値に基づいて、方策を生成する。
　学習装置１００によれば、方策の更新が期待できない状態に陥っても、あるいは、方策の更新がなかなか進まないと予想される状態に陥ってもエピソードの実行を継続する場合と比較して、強化学習をより効率的に行うことができる。 Furthermore, the reset determination unit 194 suspends the simulation when it is determined that a predetermined condition for suspending the simulation of the behavior of the control target 910 is satisfied. The measure generation unit 193 generates a measure based on a reward value calculated based on the simulation of the behavior of the control target 910.
According to the learning device 100, reinforcement learning can be performed more efficiently than in a case where the execution of an episode is continued even when the state in which policy updating is not expected or when the state in which policy updating is expected to progress slowly is reached.

　また、シミュレーション部１９２は、制御対象９１０の行動のシミュレーションをおこなって、前記制御対象が行動する環境の状態を計算する。
　学習装置１００によれば、シミュレーションにおける時間ステップを遡ることで、より効率的に強化学習を行えることが期待される。 Furthermore, the simulation unit 192 simulates the behavior of the controlled object 910 and calculates the state of the environment in which the controlled object behaves.
According to the learning device 100, it is expected that reinforcement learning can be performed more efficiently by going back in time steps in a simulation.

　また、リセット決定部１９４は、瞬時報酬値の累積値である累積報酬値の変化に基づいて、シミュレーションを中断するか否かを決定する。
　これにより、リセット決定部１９４は、１つの時間ステップにおける状態だけでなく、状態の変化に基づいて、シミュレーションを中断するか否かを決定することができる。学習装置１００によれば、この点で、シミュレーションを中断するか否かを適切に決定できることが期待される。 In addition, the reset decision unit 194 decides whether or not to interrupt the simulation based on a change in the cumulative reward value, which is the cumulative value of the instantaneous reward value.
This allows the reset decision unit 194 to decide whether to suspend the simulation based on not only the state at one time step but also on a change in state. In this respect, the learning device 100 is expected to be able to appropriately decide whether to suspend the simulation.

　また、リセット決定部１９４は、累積報酬値の増減に基づいて、累積報酬値が示す評価が所定の閾値よりも悪化したと判定した場合、シミュレーションを中断することに決定する。
　学習装置１００によれば、累積報酬値と閾値とを比較するという簡単な処理で、シミュレーションを中断するか否かを決定することができる。 Furthermore, when the reset decision unit 194 determines, based on an increase or decrease in the cumulative reward value, that the evaluation indicated by the cumulative reward value has deteriorated below a predetermined threshold, it decides to interrupt the simulation.
According to the learning device 100, it is possible to determine whether or not to interrupt the simulation by the simple process of comparing the cumulative reward value with a threshold value.

　また、リセット決定部１９４は、累積報酬値の閾値が示す評価の悪化の幅が、シミュレーションの進行に応じて大きくなるように閾値を更新する。
　上述したように、エピソードの初期の段階、あるいは、強化学習の初期の段階では、学習が進んでいないことで、報酬値が大きくならないケースが多いことが考えられる。この場合、リセット決定部１９４が閾値を比較的小さい値に設定してエピソードの実行の中断を早めることで、行動決定部１９１が、いろいろな行動を試すようになり、報酬値が大きくなる行動（または一連の行動）を比較的早い時点で見つけられることが期待される。学習装置１００によれば、この点で、強化学習を効率的に行えると期待される。 Furthermore, the reset determination unit 194 updates the threshold value of the cumulative reward value so that the extent of deterioration in the evaluation indicated by the threshold value increases as the simulation progresses.
As described above, in the early stages of an episode or in the early stages of reinforcement learning, it is considered that there are many cases in which the reward value does not become large because learning has not progressed. In this case, it is expected that the reset decision unit 194 sets the threshold to a relatively small value to hasten the interruption of the execution of the episode, so that the action decision unit 191 will try various actions, and an action (or a series of actions) that will increase the reward value will be found at a relatively early stage. In this respect, it is expected that the learning device 100 can efficiently perform reinforcement learning.

　また、リセット決定部１９４は、累積報酬値の増減に基づいて、累積報酬値が示す評価の悪化が所定の時間ステップ数以上継続した場合、シミュレーションを中断することに決定する。
　学習装置１００によれば、累積報酬値が継続的に減少しているステップ数を数えるという簡単な処理で、シミュレーションを中断するか否かを決定することができる。 Furthermore, the reset decision unit 194 decides to interrupt the simulation if the deterioration of the evaluation indicated by the cumulative reward value continues for a predetermined number of time steps or more, based on the increase or decrease in the cumulative reward value.
According to the learning device 100, it is possible to determine whether or not to interrupt the simulation by the simple process of counting the number of steps in which the cumulative reward value is continuously decreasing.

　また、リセット決定部１９４は、瞬時報酬値の累積値である累積報酬値の予測値である価値関数値に基づいて、前記シミュレーションを中断するか否かを決定する。
　これにより、リセット決定部１９４は、１つの時間ステップにおける状態だけでなく、状態の変化に基づいて、シミュレーションを中断するか否かを決定することができる。学習装置１００によれば、この点で、シミュレーションを中断するか否かを適切に決定できることが期待される。 Further, the reset decision unit 194 decides whether or not to interrupt the simulation based on a value function value that is a predicted value of a cumulative reward value that is a cumulative value of instantaneous reward values.
This allows the reset decision unit 194 to decide whether to suspend the simulation based on not only the state at one time step but also on a change in state. In this respect, the learning device 100 is expected to be able to appropriately decide whether to suspend the simulation.

　また、リセット決定部１９４は、価値関数値の増減に基づいて、価値関数値が示す評価が所定の閾値よりも悪化したと判定した場合、シミュレーションを中断することに決定する。
　学習装置１００によれば、価値関数値と閾値とを比較するという簡単な処理で、シミュレーションを中断するか否かを決定することができる。 Furthermore, when reset decision unit 194 determines, based on an increase or decrease in the value function value, that the evaluation indicated by the value function value has deteriorated below a predetermined threshold value, it decides to interrupt the simulation.
According to the learning device 100, it is possible to determine whether or not to interrupt the simulation by the simple process of comparing the value function value with a threshold value.

　また、リセット決定部１９４は、価値関数値の閾値が示す評価の悪化の幅が、シミュレーションの進行に応じて大きくなるように閾値を更新する。
　上述したように、エピソードの初期の段階、あるいは、強化学習の初期の段階では、学習が進んでいないことで、報酬値が大きくならないケースが多いことが考えられる。この場合、リセット決定部１９４が閾値を比較的小さい値に設定してエピソードの実行の中断を早めることで、行動決定部１９１が、いろいろな行動を試すようになり、報酬値が大きくなる行動（または一連の行動）を比較的早い時点で見つけられることが期待される。
　学習装置１００によれば、この点で、強化学習を効率的に行えると期待される。 Furthermore, the reset determination unit 194 updates the threshold value so that the extent of deterioration in the evaluation indicated by the threshold value of the value function value increases as the simulation progresses.
As described above, in the early stage of an episode or in the early stage of reinforcement learning, it is considered that there are many cases where the reward value does not become large because learning has not progressed. In this case, it is expected that the reset decision unit 194 sets the threshold to a relatively small value and interrupts the execution of the episode earlier, so that the action decision unit 191 tries various actions and finds an action (or a series of actions) that will increase the reward value at a relatively early point in time.
In this respect, the learning device 100 is expected to enable efficient reinforcement learning.

　また、リセット決定部１９４は、価値関数値の増減に基づいて、価値関数値が示す評価の悪化が所定の時間ステップ数以上継続した場合、シミュレーションを中断することに決定する。
　学習装置１００によれば、価値関数値が継続的に減少しているステップ数を数えるという簡単な処理で、シミュレーションを中断するか否かを決定することができる。 Furthermore, reset decision unit 194 decides to interrupt the simulation when the deterioration of the evaluation indicated by the value function value continues for a predetermined number of time steps or more, based on the increase or decrease in the value function value.
According to the learning device 100, it is possible to determine whether or not to interrupt the simulation by the simple process of counting the number of steps in which the value function value is continuously decreasing.

　また、リセット決定部１９４は、シミュレーションで計算される、制御対象９１０が行動する環境の現在の状態が、中断が必要な状態として予め設定されている１つ以上の状態の何れかと一定以上類似していると判定した場合、シミュレーションを中断することに決定する。
　学習装置１００を設計する設計者は、シミュレーションの中断が必要な状態のサンプルを用意すればよく、シミュレーションの中断の要否の判定基準をルールとして設計しておく必要はない。学習装置１００によれば、この点で、設計者の負担が小さい。 In addition, the reset decision unit 194 decides to interrupt the simulation when it determines that the current state of the environment in which the control object 910 is acting, as calculated in the simulation, is similar to one or more states that are pre-set as states requiring interruption to a certain extent or more.
The designer who designs learning device 100 only needs to prepare a sample of a state in which a simulation needs to be interrupted, and does not need to design rules for determining whether or not a simulation needs to be interrupted. In this respect, learning device 100 reduces the burden on the designer.

　また、リセット決定部１９４と表示部１２０との組み合わせは、状態をユーザに提示する。リセット決定部１９４と操作入力部１３０との組み合わせは、中断が必要な状態を指定するユーザ操作を受け付ける。
　学習装置１００によれば、ユーザの指定を受けて、シミュレーションの中断が必要な状態のサンプルを取得することができる。学習装置１００によれば、この点で、学習装置１００の設計者の負担が小さい。また、ユーザは、シミュレーションの中断が必要か否かについてのユーザ自らの判断を、学習装置１００に反映させることができる。 The combination of the reset determination unit 194 and the display unit 120 presents the state to the user. The combination of the reset determination unit 194 and the operation input unit 130 accepts a user operation that specifies a state that requires interruption.
According to the learning device 100, a sample of a state in which a simulation needs to be interrupted can be acquired in response to a user's designation. In this respect, the learning device 100 reduces the burden on the designer of the learning device 100. Furthermore, the user can reflect in the learning device 100 the user's own judgment as to whether or not a simulation needs to be interrupted.

　また、再開状態決定部１９５は、シミュレーションを再開する際の状態を決定する。
　上述したように、エピソードの最初は、必ずしも方策の更新が進まないことが考えられる。学習装置１００について上述したように、再開状態決定部１９５が、エピソードを遡る先として、エピソードの最初に限らず途中の時間ステップも選択できることで、学習装置１００が、より効率的に強化学習を行えることが期待される。 Furthermore, the restart state determining unit 195 determines the state when the simulation is restarted.
As described above, it is considered that the policy update does not necessarily progress at the beginning of an episode. As described above for the learning device 100, the resumption state determination unit 195 can select not only the beginning of an episode but also an intermediate time step as the destination to go back in the episode, and therefore it is expected that the learning device 100 can perform reinforcement learning more efficiently.

　また、再開状態決定部１９５は、シミュレーションの中断時まで実行されていたエピソード内でシミュレーションを再開可能な時間ステップとして状態が記憶されている実行済みの時間ステップであるリセット先候補に設定されている確率分布に基づいて、何れか１つのリセット先候補を選択する。 The restart state determination unit 195 also selects one of the reset destination candidates based on a probability distribution set for the reset destination candidates, which are executed time steps whose states are stored as time steps at which the simulation can be restarted within the episode that was executed until the simulation was interrupted.

　上述したように、報酬値が示す評価が悪い評価となる要因の事象が発生している場合に、その事象が発生する前の時間ステップまで戻る観点からは、エピソード内でなるべく早い時間ステップ（エピソードの最初に近い時間ステップ）まで戻ることが考えられる。一方、エピソードの最初の方における学習の繰り返しを軽減させる観点からは、エピソード内で実行済みの時間ステップのうち、なるべく遅い時間ステップまで戻ることが考えられる。このように、報酬値が示す評価が悪い評価となる要因の事象が発生している状態を避けて強化学習の効率化を図ることと、エピソードの最初の方における学習の繰り返しを軽減させて強化学習の効率化を図ることとは、トレードオフの関係にある。 As mentioned above, when an event occurs that causes the reward value to be poorly evaluated, from the perspective of going back to a time step before that event occurred, it is possible to go back to as early a time step as possible in the episode (a time step closest to the beginning of the episode). On the other hand, from the perspective of reducing the number of learning repetitions at the beginning of the episode, it is possible to go back to as late a time step as possible among the time steps that have already been executed in the episode. In this way, there is a trade-off between improving the efficiency of reinforcement learning by avoiding states in which an event has occurred that causes the reward value to be poorly evaluated, and improving the efficiency of reinforcement learning by reducing the number of learning repetitions at the beginning of the episode.

　そこで、再開状態決定部１９５が、複数のリセット先候補のうちの何れか１つを確率的に選択することで、エピソードの実行が中断されるごとに毎回エピソードの最初の方からエピソードの実行を再開することを避けることができる。また、選択したリセット先候補では、報酬値が示す評価が悪い評価となる要因の事象が既に発生している場合、エピソードの中断をさらに１回以上繰り返すことで、報酬値が示す評価が悪い評価となる要因の事象が発生する前のリセット先候補を選択できると期待される。 Then, by having the restart state determination unit 195 probabilistically select one of the multiple reset destination candidates, it is possible to avoid restarting the execution of the episode from the beginning of the episode every time the execution of the episode is interrupted. Furthermore, if an event that would cause the reward value to be evaluated poorly has already occurred in the selected reset destination candidate, it is expected that by interrupting the episode one or more times, it will be possible to select a reset destination candidate before the event that would cause the reward value to be evaluated poorly occurred.

　また、リセット決定部１９４は、複数のリセット先候補のうち何れか１つを一様分布の確率分布に従って選択する。
　学習装置１００によれば、報酬値が示す評価が悪い評価となる要因の事象が発生する前の状態に到達するために遡る時間ステップ数の目安が不明な制御対象９１０に対応して、リセット先候補を選択することができる。 Furthermore, the reset determination unit 194 selects one of the multiple reset destination candidates in accordance with a uniform probability distribution.
According to the learning device 100, a reset destination candidate can be selected in response to a control object 910 for which the approximate number of time steps to go back to reach a state before the occurrence of an event that caused the evaluation indicated by the reward value to be poor is unknown.

　また、リセット決定部１９４は、複数のリセット先候補に対して、累積報酬値の増減で示される、シミュレーションの中断時における評価の悪化が大きいほど、シミュレーションの中断時とリセット先候補との間の時間ステップ数が多いリセット先候補を選び易くなるように確率分布を設定し、設定した確率分布に従って何れか１つのリセット先候補を選択する。 The reset decision unit 194 also sets a probability distribution for multiple reset destination candidates so that the greater the deterioration in evaluation at the time the simulation is interrupted, which is indicated by an increase or decrease in the cumulative reward value, the more likely it is to select a reset destination candidate with a large number of time steps between the time the simulation is interrupted and the reset destination candidate, and selects one of the reset destination candidates according to the set probability distribution.

　上述したように、累積報酬値が急激に減少する状況は、特に避けたいと考えられる。累積報酬値が急激に減少する状況を回避できる可能性を高めるために、シミュレーションの中断時に遡る時間ステップ数を多くとることが考えられる。
　一方、累積報酬値が緩やかに減少する状況は、累積報酬値が急激に減少する状況と比較すると、避けたい度合いは小さいことが考えられる。この場合、エピソードの最初の方における学習の繰り返しを軽減させるために、シミュレーションの中断時に遡る時間ステップ数を比較的少なくすることが考えられる。
　再開状態決定部１９５が、累積報酬値の増減の大きさに応じて、選択しやすいリセット先候補を変えることで、学習装置１００が、強化学習を効率的に行えると期待される。 As mentioned above, it is particularly desirable to avoid a situation in which the cumulative reward value decreases rapidly. In order to increase the possibility of avoiding a situation in which the cumulative reward value decreases rapidly, it is possible to take a large number of time steps back to the time when the simulation was interrupted.
On the other hand, a situation where the cumulative reward value gradually decreases is less desirable than a situation where the cumulative reward value rapidly decreases. In this case, in order to reduce the repetition of learning at the beginning of an episode, it is conceivable to make the number of time steps back when the simulation is interrupted relatively small.
It is expected that the learning device 100 can efficiently perform reinforcement learning by the restart state determination unit 195 changing the reset destination candidates that are easy to select depending on the magnitude of the increase or decrease in the cumulative reward value.

　また、リセット決定部１９４は、リセット先候補のうち、累積報酬値の増減で示される評価の変化が良化から悪化に転じている時間ステップのうちシミュレーションの中断時との間の時間ステップ数が最も少ない時間ステップまたはそれよりも前の時間ステップに相当するリセット先候補を選択する。 The reset decision unit 194 also selects, from among the reset destination candidates, a time step in which the change in evaluation, indicated by an increase or decrease in the cumulative reward value, has turned from improvement to deterioration, and which corresponds to the time step with the fewest number of time steps between the time the simulation was interrupted, or a time step earlier than that.

　図１０は、本開示のいくつかの実施形態に係る学習装置の構成の、もう１つの例を示す図である。図１０に示す構成で、学習装置６１０は、方策生成部６１１と、行動決定部６１２とを備える。
　かかる構成で、方策生成部６１１は、学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象の行動に対する評価を示す値である報酬値に基づいて、行動の決定規則である方策を生成する。行動決定部６１２は、制御対象の行動を方策に基づいて決定する。
　方策生成部６１１は、方策生成手段の例に該当する。行動決定部６１２は、行動決定手段の例に該当する。 10 is a diagram showing another example of the configuration of a learning device according to some embodiments of the present disclosure. In the configuration shown in FIG. 10, a learning device 610 includes a policy generator 611 and an action determiner 612.
In this configuration, the policy generating unit 611 generates a policy, which is a decision rule for behavior, based on a reward value, which is a value indicating an evaluation of the behavior of the controlled object in a state going back a time step in an episode representing a learning period. The behavior deciding unit 612 decides the behavior of the controlled object based on the policy.
The measure generating unit 611 corresponds to an example of a measure generating means, and the action deciding unit 612 corresponds to an example of an action deciding means.

　学習装置６１０によれば、エピソード内で時間ステップを遡ることで、より効率的に強化学習を行えることが期待される。
　例えば、学習装置６１０によれば、報酬値が示す評価が悪くなる要因が発生した時間ステップよりも過去に遡って、報酬値が示す評価が悪くなる要因の発生を回避できる可能性がある。これにより、学習装置６１０では、エピソード全体での評価が良くなるような方策を、より早く生成できることが期待される。 It is expected that the learning device 610 can perform reinforcement learning more efficiently by going back in time steps within an episode.
For example, the learning device 610 may be able to avoid the occurrence of a factor that would cause the evaluation indicated by the reward value to deteriorate by going back to a time step in which the factor caused the evaluation indicated by the reward value to deteriorate. This is expected to enable the learning device 610 to generate a measure that will improve the evaluation of the entire episode more quickly.

　方策生成部６１１は、例えば、図１の方策生成部１９３等の機能を用いて実現することができる。行動決定部６１２は、例えば、図１の行動決定部１９１等の機能を用いて実現することができる。 The policy generation unit 611 can be realized, for example, by using the functions of the policy generation unit 193 in FIG. 1, etc. The action decision unit 612 can be realized, for example, by using the functions of the action decision unit 191 in FIG. 1, etc.

　図１１は、本開示のいくつかの実施形態に係る制御システムの構成の、もう１つの例を示す図である。
　図１１に示す構成で、制御システム６２０は、方策生成部６２２と、行動決定部６２３とを備える。 FIG. 11 is a diagram illustrating another example of a control system configuration according to some embodiments of the present disclosure.
In the configuration shown in FIG. 11, the control system 620 includes a measure generator 622 and an action determiner 623 .

　かかる構成で、方策生成部６２２は、学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象の行動に対する評価を示す値である報酬値に基づいて、行動の決定規則である方策を生成する。行動決定部６２３は、制御対象の行動を方策に基づいて決定する。
　方策生成部６２２は、方策生成手段の例に該当する。行動決定部６２３は、行動決定手段の例に該当する。 In this configuration, the policy generating unit 622 generates a policy, which is a decision rule for behavior, based on a reward value, which is a value indicating an evaluation of the behavior of the controlled object in a state going back a time step in an episode representing a learning period. The behavior deciding unit 623 decides the behavior of the controlled object based on the policy.
The measure generating unit 622 corresponds to an example of a measure generating means, and the action deciding unit 623 corresponds to an example of an action deciding means.

　制御システム６２０によれば、エピソード内で時間ステップを遡ることで、より効率的に強化学習を行えることが期待される。
　例えば、制御システム６２０によれば、報酬値が示す評価が悪くなる要因が発生した時間ステップよりも過去に遡って、報酬値が示す評価が悪くなる要因の発生を回避できる可能性がある。これにより、制御システム６２０では、エピソード全体での評価が良くなるような方策を、より早く生成できることが期待される。 It is expected that the control system 620 can perform reinforcement learning more efficiently by going back in time steps within an episode.
For example, the control system 620 may be able to avoid the occurrence of a factor that would cause the evaluation indicated by the reward value to deteriorate by going back to a time step in which the factor caused the evaluation indicated by the reward value to deteriorate. This is expected to enable the control system 620 to generate a measure that would improve the evaluation of the entire episode more quickly.

　方策生成部６２２は、例えば、図１の方策生成部１９３等の機能を用いて実現することができる。行動決定部６２３は、例えば、図１の行動決定部１９１等の機能を用いて実現することができる。 The policy generation unit 622 can be realized, for example, by using the functions of the policy generation unit 193 in FIG. 1, etc. The action decision unit 623 can be realized, for example, by using the functions of the action decision unit 191 in FIG. 1, etc.

　図１２は、本開示のいくつかの実施形態に係る入出力装置の構成の例を示す図である。図１２に示す構成で、入出力装置６３０は、状態提示部６３１と、状態指定受付部６３２とを備える。
　かかる構成で、状態提示部６３１は、制御対象が行動する環境の状態をユーザに提示する。状態指定受付部６３２は、環境を模擬するシミュレータの中断が必要な状態を指定するユーザ操作を受け付ける。
　状態提示部６３１は、状態提示手段の例に該当する。状態指定受付部６３２は、状態指定受付手段の例に該当する。 12 is a diagram illustrating an example of a configuration of an input/output device according to some embodiments of the present disclosure. In the configuration illustrated in FIG. 12, an input/output device 630 includes a status presenting unit 631 and a status designation receiving unit 632.
With this configuration, the state presenting unit 631 presents to the user the state of the environment in which the controlled object acts. The state designation receiving unit 632 receives a user operation that designates a state in which the simulator that simulates the environment needs to be interrupted.
The status presenting unit 631 is an example of a status presenting unit, and the status designation receiving unit 632 is an example of a status designation receiving unit.

　入出力装置６３０によれば、ユーザの指定を受けて、シミュレーションの中断が必要な状態のサンプルを取得することができる。入出力装置６３０によれば、この点で、入出力装置６３０の設計者の負担が小さい。また、ユーザは、シミュレーションの中断が必要か否かについてのユーザ自らの判断を、シミュレーションの実行に反映させることができる。
　状態提示部６３１は、例えば、図１のリセット決定部１９４および表示部１２０等の機能を用いて実現することができる。状態指定受付部６３２は、例えば、図１のリセット決定部１９４および操作入力部１３０等の機能を用いて実現することができる。 According to the input/output device 630, a sample of a state in which a simulation needs to be interrupted can be acquired in response to a user's designation. In this respect, the input/output device 630 reduces the burden on the designer of the input/output device 630. Furthermore, the user can reflect his/her own judgment as to whether or not a simulation needs to be interrupted in the execution of the simulation.
The state presenting unit 631 can be realized, for example, by using the functions of the reset determining unit 194 and the display unit 120 in Fig. 1. The state designation receiving unit 632 can be realized, for example, by using the functions of the reset determining unit 194 and the operation input unit 130 in Fig. 1.

　図１３は、本開示のいくつかの実施形態に係る学習方法における処理の手順の例を示す図である。図１３に示す学習方法は、方策を生成すること（ステップＳ６１１）と、行動を決定すること（ステップＳ６１２）とを含む。 FIG. 13 is a diagram showing an example of a processing procedure in a learning method according to some embodiments of the present disclosure. The learning method shown in FIG. 13 includes generating a strategy (step S611) and determining an action (step S612).

　方策を生成すること（ステップＳ６１１）では、コンピュータが、学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象の行動に対する評価を示す値である報酬値に基づいて、前記行動の決定規則である方策を生成する。
　行動を決定すること（ステップＳ６１２）では、コンピュータが、制御対象の行動を方策に基づいて決定する。 In generating a policy (step S611), the computer generates a policy, which is a decision rule for the behavior, based on a reward value, which is a value indicating an evaluation of the behavior of the controlled object at a time step back within the episode representing the learning period.
In determining an action (step S612), the computer determines an action of the controlled object based on a strategy.

　図１３に示す学習方法によれば、エピソード内で時間ステップを遡ることで、より効率的に強化学習を行えることが期待される。
　例えば、図１３に示す学習方法によれば、報酬値が示す評価が悪くなる要因が発生した時間ステップよりも過去に遡って、報酬値が示す評価が悪くなる要因の発生を回避できる可能性がある。これにより、図１３に示す学習方法では、エピソード全体での評価が良くなるような方策を、より早く生成できることが期待される。 According to the learning method shown in FIG. 13, it is expected that reinforcement learning can be performed more efficiently by going back a time step within an episode.
For example, according to the learning method shown in Fig. 13, it is possible to avoid the occurrence of a factor that deteriorates the evaluation indicated by the reward value by going back to the time step in which the factor occurred that deteriorates the evaluation indicated by the reward value. As a result, it is expected that the learning method shown in Fig. 13 can generate a policy that improves the evaluation of the entire episode more quickly.

　図１４は、本開示のいくつかの少なくとも１つの実施形態に係るコンピュータの構成の例を示す図である。
　図１４に示す構成で、コンピュータ７００は、ＣＰＵ７１０と、主記憶装置７２０と、補助記憶装置７３０と、インタフェース７４０と、不揮発性記録媒体７５０とを備える。 FIG. 14 is a diagram illustrating an example of a computer configuration in accordance with at least some embodiments of the present disclosure.
In the configuration shown in FIG. 14, a computer 700 includes a CPU 710 , a main memory device 720 , an auxiliary memory device 730 , an interface 740 , and a non-volatile recording medium 750 .

　上記の学習装置１００、制御装置２００、学習装置６１０、学習装置６２１、制御装置６２６、および、入出力装置６３０のうち何れか１つ以上またはその一部が、コンピュータ７００に実装されてもよい。その場合、上述した各処理部の動作は、プログラムの形式で補助記憶装置７３０に記憶されている。ＣＰＵ７１０は、プログラムを補助記憶装置７３０から読み出して主記憶装置７２０に展開し、当該プログラムに従って上記処理を実行する。また、ＣＰＵ７１０は、プログラムに従って、上述した各記憶部に対応する記憶領域を主記憶装置７２０に確保する。各装置と他の装置との通信は、インタフェース７４０が通信機能を有し、ＣＰＵ７１０の制御に従って通信を行うことで実行される。また、インタフェース７４０は、不揮発性記録媒体７５０用のポートを有し、不揮発性記録媒体７５０からの情報の読出、および、不揮発性記録媒体７５０への情報の書込を行う。 Any one or more of the learning device 100, the control device 200, the learning device 610, the learning device 621, the control device 626, and the input/output device 630, or a part of them, may be implemented in the computer 700. In this case, the operation of each of the above-mentioned processing units is stored in the auxiliary storage device 730 in the form of a program. The CPU 710 reads the program from the auxiliary storage device 730, expands it in the main storage device 720, and executes the above-mentioned processing according to the program. The CPU 710 also secures a storage area corresponding to each of the above-mentioned storage units in the main storage device 720 according to the program. Communication between each device and other devices is executed by the interface 740 having a communication function and communicating according to the control of the CPU 710. The interface 740 also has a port for the non-volatile recording medium 750, and reads information from the non-volatile recording medium 750 and writes information to the non-volatile recording medium 750.

　学習装置１００がコンピュータ７００に実装される場合、処理部１９０およびその各部の動作は、プログラムの形式で補助記憶装置７３０に記憶されている。ＣＰＵ７１０は、プログラムを補助記憶装置７３０から読み出して主記憶装置７２０に展開し、当該プログラムに従って上記処理を実行する。 When the learning device 100 is implemented in a computer 700, the operations of the processing unit 190 and each of its units are stored in the auxiliary storage device 730 in the form of a program. The CPU 710 reads the program from the auxiliary storage device 730, expands it in the main storage device 720, and executes the above-mentioned processing according to the program.

　また、ＣＰＵ７１０は、プログラムに従って、記憶部１８０のための記憶領域を主記憶装置７２０に確保する。通信部１１０による他の装置との通信は、インタフェース７４０が通信機能を有し、ＣＰＵ７１０の制御に従って動作することで実行される。表示部１２０による画像の表示は、インタフェース７４０が表示装置を備え、ＣＰＵ７１０の制御に従って各種画像の表示することで実行される。操作入力部１３０によるユーザ操作の受け付けは、インタフェース７４０が入力デバイスを備え、ＣＰＵ７１０の制御に従ってユーザ操作を受け付けることで実行される。 The CPU 710 also reserves a memory area for the memory unit 180 in the main memory device 720 in accordance with the program. Communication with other devices by the communication unit 110 is achieved by the interface 740 having a communication function and operating under the control of the CPU 710. Display of images by the display unit 120 is achieved by the interface 740 having a display device and displaying various images under the control of the CPU 710. Reception of user operations by the operation input unit 130 is achieved by the interface 740 having an input device and accepting user operations under the control of the CPU 710.

　制御装置２００がコンピュータ７００に実装される場合、その動作は、プログラムの形式で補助記憶装置７３０に記憶されている。ＣＰＵ７１０は、プログラムを補助記憶装置７３０から読み出して主記憶装置７２０に展開し、当該プログラムに従って上記処理を実行する。 When the control device 200 is implemented in the computer 700, its operation is stored in the form of a program in the auxiliary storage device 730. The CPU 710 reads the program from the auxiliary storage device 730, expands it in the main storage device 720, and executes the above-mentioned processing according to the program.

　また、ＣＰＵ７１０は、プログラムに従って、制御装置２００が処理を行うための記憶領域を主記憶装置７２０に確保する。制御装置２００と他の装置との通信は、インタフェース７４０が通信機能を有し、ＣＰＵ７１０の制御に従って動作することで実行される。制御装置２００とユーザとのインタラクションは、インタフェース７４０が入力デバイスおよび出力デバイスを有し、ＣＰＵ７１０の制御に従って出力デバイスにて情報をユーザに提示し、入力デバイスにてユーザ操作を受け付けることで実行される。 The CPU 710 also allocates a memory area in the main memory device 720 for the control device 200 to perform processing according to the program. Communication between the control device 200 and other devices is performed by the interface 740, which has a communication function and operates according to the control of the CPU 710. Interaction between the control device 200 and a user is performed by the interface 740, which has an input device and an output device, presenting information to the user via the output device according to the control of the CPU 710, and accepting user operations via the input device.

　学習装置６１０がコンピュータ７００に実装される場合、方策生成部６１１と、行動決定部６１２との動作は、プログラムの形式で補助記憶装置７３０に記憶されている。ＣＰＵ７１０は、プログラムを補助記憶装置７３０から読み出して主記憶装置７２０に展開し、当該プログラムに従って上記処理を実行する。 When the learning device 610 is implemented in the computer 700, the operations of the policy generation unit 611 and the action decision unit 612 are stored in the auxiliary storage device 730 in the form of a program. The CPU 710 reads the program from the auxiliary storage device 730, expands it in the main storage device 720, and executes the above-mentioned processing according to the program.

　また、ＣＰＵ７１０は、プログラムに従って、学習装置６１０が処理を行うための記憶領域を主記憶装置７２０に確保する。学習装置６１０と他の装置との通信は、インタフェース７４０が通信機能を有し、ＣＰＵ７１０の制御に従って動作することで実行される。学習装置６１０とユーザとのインタラクションは、インタフェース７４０が入力デバイスおよび出力デバイスを有し、ＣＰＵ７１０の制御に従って出力デバイスにて情報をユーザに提示し、入力デバイスにてユーザ操作を受け付けることで実行される。 The CPU 710 also allocates a memory area in the main memory device 720 for the learning device 610 to perform processing according to the program. Communication between the learning device 610 and other devices is performed by the interface 740, which has a communication function and operates according to the control of the CPU 710. Interaction between the learning device 610 and a user is performed by the interface 740, which has an input device and an output device, presenting information to the user via the output device according to the control of the CPU 710, and accepting user operations via the input device.

　学習装置６２１がコンピュータ７００に実装される場合、方策生成部６２２と、行動決定部６２３との動作は、プログラムの形式で補助記憶装置７３０に記憶されている。ＣＰＵ７１０は、プログラムを補助記憶装置７３０から読み出して主記憶装置７２０に展開し、当該プログラムに従って上記処理を実行する。 When the learning device 621 is implemented in the computer 700, the operations of the policy generation unit 622 and the action decision unit 623 are stored in the auxiliary storage device 730 in the form of a program. The CPU 710 reads the program from the auxiliary storage device 730, expands it in the main storage device 720, and executes the above-mentioned processing according to the program.

　また、ＣＰＵ７１０は、プログラムに従って、学習装置６２１が処理を行うための記憶領域を主記憶装置７２０に確保する。学習装置６２１と他の装置との通信は、インタフェース７４０が通信機能を有し、ＣＰＵ７１０の制御に従って動作することで実行される。学習装置６２１とユーザとのインタラクションは、インタフェース７４０が入力デバイスおよび出力デバイスを有し、ＣＰＵ７１０の制御に従って出力デバイスにて情報をユーザに提示し、入力デバイスにてユーザ操作を受け付けることで実行される。 The CPU 710 also allocates a memory area in the main memory device 720 for the learning device 621 to perform processing according to the program. Communication between the learning device 621 and other devices is performed by the interface 740, which has a communication function and operates according to the control of the CPU 710. Interaction between the learning device 621 and a user is performed by the interface 740, which has an input device and an output device, presenting information to the user via the output device according to the control of the CPU 710, and accepting user operations via the input device.

　制御装置６２６がコンピュータ７００に実装される場合、その動作は、プログラムの形式で補助記憶装置７３０に記憶されている。ＣＰＵ７１０は、プログラムを補助記憶装置７３０から読み出して主記憶装置７２０に展開し、当該プログラムに従って上記処理を実行する。 When the control device 626 is implemented in the computer 700, its operation is stored in the auxiliary storage device 730 in the form of a program. The CPU 710 reads the program from the auxiliary storage device 730, expands it in the main storage device 720, and executes the above-mentioned processing according to the program.

　また、ＣＰＵ７１０は、プログラムに従って、制御装置６２６が処理を行うための記憶領域を主記憶装置７２０に確保する。制御装置６２６と他の装置との通信は、インタフェース７４０が通信機能を有し、ＣＰＵ７１０の制御に従って動作することで実行される。制御装置６２６とユーザとのインタラクションは、インタフェース７４０が入力デバイスおよび出力デバイスを有し、ＣＰＵ７１０の制御に従って出力デバイスにて情報をユーザに提示し、入力デバイスにてユーザ操作を受け付けることで実行される。 The CPU 710 also allocates a memory area in the main memory 720 for the control device 626 to perform processing according to the program. Communication between the control device 626 and other devices is performed by the interface 740, which has a communication function and operates according to the control of the CPU 710. Interaction between the control device 626 and a user is performed by the interface 740, which has an input device and an output device, presenting information to the user via the output device according to the control of the CPU 710, and accepting user operations via the input device.

　入出力装置６３０がコンピュータ７００に実装される場合、状態提示部６３１と、状態指定受付部６３２との動作は、プログラムの形式で補助記憶装置７３０に記憶されている。ＣＰＵ７１０は、プログラムを補助記憶装置７３０から読み出して主記憶装置７２０に展開し、当該プログラムに従って上記処理を実行する。 When the input/output device 630 is implemented in the computer 700, the operations of the state presentation unit 631 and the state designation reception unit 632 are stored in the auxiliary storage device 730 in the form of a program. The CPU 710 reads the program from the auxiliary storage device 730, expands it in the main storage device 720, and executes the above-mentioned processing according to the program.

　また、ＣＰＵ７１０は、プログラムに従って、入出力装置６３０が処理を行うための記憶領域を主記憶装置７２０に確保する。入出力装置６３０と他の装置との通信は、インタフェース７４０が通信機能を有し、ＣＰＵ７１０の制御に従って動作することで実行される。入出力装置６３０とユーザとのインタラクションは、インタフェース７４０が入力デバイスおよび出力デバイスを有し、ＣＰＵ７１０の制御に従って出力デバイスにて情報をユーザに提示し、入力デバイスにてユーザ操作を受け付けることで実行される。 The CPU 710 also allocates a storage area in the main memory device 720 for the I/O device 630 to perform processing according to the program. Communication between the I/O device 630 and other devices is performed by the interface 740, which has a communication function and operates according to the control of the CPU 710. Interaction between the I/O device 630 and a user is performed by the interface 740, which has an input device and an output device, presenting information to the user via the output device according to the control of the CPU 710, and accepting user operations via the input device.

　上述したプログラムのうち何れか１つ以上が不揮発性記録媒体７５０に記録されていてもよい。この場合、インタフェース７４０が不揮発性記録媒体７５０からプログラムを読み出すようにしてもよい。そして、ＣＰＵ７１０が、インタフェース７４０が読み出したプログラムを直接実行するか、あるいは、主記憶装置７２０または補助記憶装置７３０に一旦保存して実行するようにしてもよい。 Any one or more of the above-mentioned programs may be recorded on the non-volatile recording medium 750. In this case, the interface 740 may read the program from the non-volatile recording medium 750. The CPU 710 may then directly execute the program read by the interface 740, or may temporarily store the program in the main memory device 720 or the auxiliary memory device 730 and then execute it.

　なお、学習装置１００、制御装置２００、学習装置６１０、学習装置６２１、制御装置６２６、および、入出力装置６３０が行う処理の全部または一部を実行するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより各部の処理を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳ（Operating System）や周辺機器等のハードウェアを含むものとする。
　また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ（Read Only Memory）、ＣＤ－ＲＯＭ（Compact Disc Read Only Memory）等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。また上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよい。 In addition, a program for executing all or part of the processing performed by learning device 100, control device 200, learning device 610, learning device 621, control device 626, and input/output device 630 may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read into a computer system and executed to perform processing of each part. Note that the term "computer system" here includes hardware such as an OS (Operating System) and peripheral devices.
Furthermore, the term "computer-readable recording medium" refers to portable media such as flexible disks, optical magnetic disks, ROMs (Read Only Memory), and CD-ROMs (Compact Disc Read Only Memory), as well as storage devices such as hard disks built into computer systems. The above-mentioned program may be for realizing part of the above-mentioned functions, or may be capable of realizing the above-mentioned functions in combination with a program already recorded in the computer system.

　以上、この開示の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この開示の要旨を逸脱しない範囲の設計等も含まれる。　Although the embodiments of this disclosure have been described in detail above with reference to the drawings, the specific configuration is not limited to this embodiment, and includes designs that do not deviate from the gist of this disclosure.

　上記の実施形態の一部又は全部は、以下の付記のようにも記載されうるが、以下には限られない。 Some or all of the above embodiments can be described as follows, but are not limited to the following:

（付記１）
　学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象の行動に対する評価を示す値である報酬値に基づいて、前記行動の決定規則である方策を生成する方策生成手段と、
　前記制御対象の行動を前記方策に基づいて決定する行動決定手段と、
　を備える学習装置。 (Appendix 1)
a policy generating means for generating a policy, which is a decision rule for the action, based on a reward value, which is a value indicating an evaluation of the action of the controlled object in a state going back a time step in an episode representing a learning period;
an action decision means for deciding an action of the control target based on the measure;
A learning device comprising:

（付記２）
　前記制御対象の行動のシミュレーションを中断する条件として予め定められている条件が成立していると判定した場合、前記シミュレーションを中断するリセット決定手段
　をさらに備え、
　前記方策生成手段は、前記シミュレーションに基づいて計算される前記報酬値に基づいて、前記方策を生成する、
　付記１に記載の学習装置。 (Appendix 2)
a reset decision means for suspending the simulation when it is determined that a predetermined condition for suspending the simulation of the behavior of the controlled object is satisfied,
The policy generating means generates the policy based on the reward value calculated based on the simulation.
2. A learning device as described in claim 1.

（付記３）
　前記シミュレーションを行って、前記制御対象が行動する環境の状態を計算するシミュレーション手段
　をさらに備える、付記２に記載の学習装置。 (Appendix 3)
3. The learning device according to claim 2, further comprising: a simulation means for performing the simulation to calculate a state of an environment in which the controlled object acts.

（付記４）
　前記リセット決定手段は、瞬時報酬値の累積値である累積報酬値の変化に基づいて、前記シミュレーションを中断するか否かを決定する、
　付記２または付記３に記載の学習装置。 (Appendix 4)
The reset decision means decides whether or not to interrupt the simulation based on a change in a cumulative reward value, which is a cumulative value of an instantaneous reward value.
4. A learning device according to claim 2 or 3.

（付記５）
　前記リセット決定手段は、前記累積報酬値の増減に基づいて、前記累積報酬値が示す評価が所定の閾値よりも悪化したと判定した場合、前記シミュレーションを中断することに決定する、
　付記４に記載の学習装置。 (Appendix 5)
The reset decision means decides to interrupt the simulation when it is determined that the evaluation indicated by the cumulative reward value has deteriorated below a predetermined threshold based on the increase or decrease of the cumulative reward value.
5. A learning device as described in claim 4.

（付記６）
　前記リセット決定手段は、前記累積報酬値の閾値が示す評価の悪化の幅が、前記シミュレーションの進行に応じて大きくなるように前記閾値を更新する、
　付記５に記載の学習装置。 (Appendix 6)
The reset determination means updates the threshold value so that a range of deterioration in the evaluation indicated by the threshold value of the cumulative reward value increases as the simulation progresses.
6. A learning device as described in appendix 5.

（付記７）
　前記リセット決定手段は、前記累積報酬値の増減に基づいて、前記累積報酬値が示す評価の悪化が所定の時間ステップ数以上継続した場合、前記シミュレーションを中断することに決定する、
　付記４から６の何れか一つに記載の学習装置。 (Appendix 7)
The reset decision means decides to interrupt the simulation when a deterioration in the evaluation indicated by the cumulative reward value continues for a predetermined number of time steps or more based on an increase or decrease in the cumulative reward value.
7. A learning device according to any one of claims 4 to 6.

（付記８）
　前記リセット決定手段は、瞬時報酬値の累積値である累積報酬値の予測値である価値関数値に基づいて、前記シミュレーションを中断するか否かを決定する、
　付記２から７の何れか一つに記載の学習装置。 (Appendix 8)
The reset determination means determines whether or not to interrupt the simulation based on a value function value that is a predicted value of a cumulative reward value that is a cumulative value of an instantaneous reward value.
8. A learning device according to any one of claims 2 to 7.

（付記９）
　前記リセット決定手段は、前記価値関数値の増減に基づいて、前記価値関数値が示す評価が所定の閾値よりも悪化したと判定した場合、前記シミュレーションを中断することに決定する、
　付記８に記載の学習装置。 (Appendix 9)
The reset decision means decides to interrupt the simulation when it is determined that the evaluation indicated by the value function value has deteriorated below a predetermined threshold based on an increase or decrease in the value function value.
9. A learning device as described in claim 8.

（付記１０）
　前記リセット決定手段は、前記価値関数値の閾値が示す評価の悪化の幅が、前記シミュレーションの進行に応じて大きくなるように前記閾値を更新する、
　付記９に記載の学習装置。 (Appendix 10)
the reset determination means updates the threshold value so that a range of deterioration in the evaluation indicated by the threshold value of the value function value increases as the simulation progresses.
10. The learning device according to claim 9.

（付記１１）
　前記リセット決定手段は、前記価値関数値の増減に基づいて、前記価値関数値が示す評価の悪化が所定の時間ステップ数以上継続した場合、前記シミュレーションを中断することに決定する、
　付記９または付記１０に記載の学習装置。 (Appendix 11)
The reset decision means decides to interrupt the simulation when a deterioration in the evaluation indicated by the value of the value function continues for a predetermined number of time steps or more based on an increase or decrease in the value of the value function.
11. The learning device according to claim 9 or 10.

（付記１２）
　前記リセット決定手段は、前記シミュレーションで計算される、前記制御対象が行動する環境の現在の状態が、中断が必要な状態として予め設定されている１つ以上の状態の何れかと一定以上類似していると判定した場合、前記シミュレーションを中断することに決定する、
　付記２から１１の何れか一つに記載の学習装置。 (Appendix 12)
the reset decision means decides to interrupt the simulation when it is determined that a current state of the environment in which the control target acts, calculated in the simulation, is similar to any one or more states that are preset as states requiring interruption to a certain degree or more;
12. A learning device according to any one of claims 2 to 11.

（付記１３）
　前記環境の状態をユーザに提示する状態提示手段と、
　前記中断が必要な状態を指定するユーザ操作を受け付ける状態指定受け付け手段と、
　をさらに備える、付記１２に記載の学習装置。 (Appendix 13)
a state presenting means for presenting a state of the environment to a user;
a state designation receiving means for receiving a user operation for designating a state in which the interruption is required;
13. The learning device of claim 12, further comprising:

（付記１４）
　中断されたシミュレーションを再開する際の、前記制御対象が行動する環境の状態を決定する再開状態決定手段
　をさらに備える、付記２から１３の何れか一つに記載の学習装置。 (Appendix 14)
14. The learning device according to any one of appendices 2 to 13, further comprising a restart state determination means for determining a state of an environment in which the controlled object acts when resuming an interrupted simulation.

（付記１５）
　前記再開状態決定手段は、シミュレーションの中断時まで実行されていたエピソード内でシミュレーションを再開可能な時間ステップとして状態が記憶されている実行済みの時間ステップであるリセット先候補に設定されている確率分布に基づいて、何れか１つのリセット先候補を選択する、
　付記１４に記載の学習装置。 (Appendix 15)
the restart state determination means selects one of the reset destination candidates based on a probability distribution set for the reset destination candidates, which are executed time steps whose states are stored as time steps at which the simulation can be restarted within the episode that was executed until the simulation was interrupted;
15. A learning device as described in appendix 14.

（付記１６）
　前記再開状態決定手段は、複数のリセット先候補のうち何れか１つを一様分布の確率分布に従って選択する、
　付記１５に記載の学習装置。 (Appendix 16)
The restart state determination means selects one of a plurality of reset destination candidates in accordance with a uniform probability distribution.
16. A learning device as described in appendix 15.

（付記１７）
　前記再開状態決定手段は、複数のリセット先候補に対して、瞬時報酬値の累積値である累積報酬値の増減で示される、シミュレーションの中断時における評価の悪化が大きいほど、シミュレーションの中断時とリセット先候補との間の時間ステップ数が多いリセット先候補を選び易くなるように確率分布を設定し、設定した確率分布に従って何れか１つのリセット先候補を選択する、
　付記１５に記載の学習装置。 (Appendix 17)
The restart state determination means sets a probability distribution for the plurality of reset destination candidates such that the greater the deterioration of the evaluation at the time of interruption of the simulation, which is indicated by an increase or decrease in an accumulated reward value that is an accumulated value of the instantaneous reward value, the easier it is to select a reset destination candidate having a larger number of time steps between the time of interruption of the simulation and the reset destination candidate, and selects one of the reset destination candidates according to the set probability distribution.
16. A learning device as described in appendix 15.

（付記１８）
　前記再開状態決定手段は、シミュレーションの中断時まで実行されていたエピソード内でシミュレーションを再開可能な時間ステップとして状態が記憶されている実行済みの時間ステップであるリセット先候補のうち、瞬時報酬値の累積値である累積報酬値の増減で示される評価の変化が良化から悪化に転じている時間ステップのうちシミュレーションの中断時との間の時間ステップ数が最も少ない時間ステップまたはそれよりも前の時間ステップに相当するリセット先候補を選択する、
　付記１４から１７の何れか一つに記載の学習装置。 (Appendix 18)
The restart state determination means selects a reset destination candidate corresponding to a time step having the smallest number of time steps between the time of interruption of the simulation and the time step in which the change in the evaluation indicated by the increase or decrease in the cumulative reward value, which is the cumulative value of the instantaneous reward value, has turned from improvement to deterioration, among the reset destination candidates which are executed time steps whose states are stored as time steps at which the simulation can be restarted within the episode that was executed until the interruption of the simulation, or a time step earlier than that.
18. A learning device according to any one of appendices 14 to 17.

（付記１９）
　付記１から１８の何れか一つに記載の学習装置を用いて得られた方策に基づいて、制御対象に対する制御を行う、制御装置。 (Appendix 19)
A control device that performs control on a control target based on a policy obtained using the learning device according to any one of appendixes 1 to 18.

（付記２０）
　学習装置と制御装置とを備え、
　前記学習装置は、
　学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象の行動に対する評価を示す値である報酬値に基づいて、前記行動の決定規則である方策を生成する方策生成手段と、
　前記制御対象の行動を前記方策に基づいて決定する行動決定手段と、
　を備え、
　前記制御装置は、前記学習装置を用いて得られた方策に基づいて、制御対象に対する制御を行う、
　制御システム。 (Appendix 20)
A learning device and a control device are provided,
The learning device includes:
a policy generating means for generating a policy, which is a decision rule for the action, based on a reward value, which is a value indicating an evaluation of the action of the controlled object in a state going back a time step in an episode representing a learning period;
an action decision means for deciding an action of the control target based on the measure;
Equipped with
The control device controls a control target based on the policy obtained using the learning device.
Control system.

（付記２１）
　制御対象が行動する環境の状態をユーザに提示する状態提示手段と、
　前記環境を模擬するシミュレータの中断が必要な状態を指定するユーザ操作を受け付ける状態指定受付手段と、
　を備える、入出力装置。 (Appendix 21)
A state presentation means for presenting to a user a state of an environment in which the controlled object acts;
a state designation receiving means for receiving a user operation for designating a state in which the simulator that simulates the environment needs to be interrupted;
An input/output device comprising:

（付記２２）
　コンピュータが、
　学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象の行動に対する評価を示す値である報酬値に基づいて、前記行動の決定規則である方策を生成し、
　前記制御対象の行動を前記方策に基づいて決定する、
　ことを含む学習方法。 (Appendix 22)
The computer
generating a policy, which is a decision rule for the action, based on a reward value, which is a value indicating an evaluation of the action of the controlled object, in a state going back a time step in the episode representing the learning period;
determining an action of the controlled object based on the policy;
A learning method that includes:

（付記２３）
　コンピュータに、
　学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象の行動に対する評価を示す値である報酬値に基づいて、前記行動の決定規則である方策を生成することと、
　前記制御対象の行動を前記方策に基づいて決定することと、
　を実行させるためのプログラムを記憶した記録媒体。 (Appendix 23)
On the computer,
generating a policy, which is a decision rule for the action, based on a reward value, which is a value indicating an evaluation of the action of the controlled object in a state going back in time steps within an episode representing a learning period;
determining an action of the controlled object based on the policy;
A recording medium storing a program for executing the above.

　この出願は、２０２３年６月２６日に出願された日本国特願２０２３－１０３９７３号を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims priority based on Japanese Patent Application No. 2023-103973, filed on June 26, 2023, the entire disclosure of which is incorporated herein by reference.

　本開示は、学習装置、制御システム、入出力装置、学習方法および記録媒体に適用してもよい。 This disclosure may be applied to a learning device, a control system, an input/output device, a learning method, and a recording medium.

　１、６２０　制御システム
　１００、６１０、６２１　学習装置
　１１０　通信部
　１２０　表示部
　１３０　操作入力部
　１８０　記憶部
　１９０　処理部
　１９１、６１２、６２３　行動決定部
　１９２　シミュレーション部
　１９３、６１１、６２２　方策生成部
　１９４　リセット決定部
　１９５　再開状態決定部
　２００、６２４　制御装置
　６３０　入出力装置
　６３１　状態提示部
　６３２　状態指定受付部
　９１０　制御対象 1, 620 Control system 100, 610, 621 Learning device 110 Communication unit 120 Display unit 130 Operation input unit 180 Memory unit 190 Processing unit 191, 612, 623 Action decision unit 192 Simulation unit 193, 611, 622 Measure generation unit 194 Reset decision unit 195 Resume state decision unit 200, 624 Control device 630 Input/output device 631 State presentation unit 632 State designation reception unit 910 Control target

Claims

a policy generating means for generating a policy, which is a decision rule for the action, based on a reward value, which is a value indicating an evaluation of the action of the controlled object in a state going back a time step in an episode representing a learning period;
an action decision means for deciding an action of the control target based on the measure;
A learning device comprising:

a reset decision means for suspending the simulation of the behavior of the controlled object when it is determined that a predetermined condition for suspending the simulation of the behavior of the controlled object is satisfied,
The policy generating means generates the policy based on the reward value calculated based on the simulation.
The learning device according to claim 1 .

The learning device according to claim 2 , further comprising: a simulation means for performing the simulation to calculate a state of an environment in which the controlled object acts.

The reset decision means decides whether or not to interrupt the simulation based on a change in a cumulative reward value, which is a cumulative value of an instantaneous reward value.
The learning device according to claim 2 or 3.

The reset decision means decides to interrupt the simulation when it is determined that the evaluation indicated by the cumulative reward value has deteriorated below a predetermined threshold based on the increase or decrease of the cumulative reward value.
The learning device according to claim 4.

The reset determination means updates the threshold value so that a range of deterioration in the evaluation indicated by the threshold value of the cumulative reward value increases as the simulation progresses.
The learning device according to claim 5 .

The reset decision means decides to interrupt the simulation when a deterioration in the evaluation indicated by the cumulative reward value continues for a predetermined number of time steps or more based on an increase or decrease in the cumulative reward value.
A learning device according to any one of claims 4 to 6.

The reset determination means determines whether or not to interrupt the simulation based on a value function value that is a predicted value of a cumulative reward value that is a cumulative value of an instantaneous reward value.
A learning device according to any one of claims 2 to 7.

The reset decision means decides to interrupt the simulation when it is determined that the evaluation indicated by the value function value has deteriorated below a predetermined threshold based on an increase or decrease in the value function value.
The learning device according to claim 8.

the reset determination means updates the threshold value so that a range of deterioration in the evaluation indicated by the threshold value of the value function value increases as the simulation progresses.
The learning device according to claim 9.

The reset decision means decides to interrupt the simulation when a deterioration in the evaluation indicated by the value of the value function continues for a predetermined number of time steps or more based on an increase or decrease in the value of the value function.
The learning device according to claim 9 or 10.

the reset decision means decides to interrupt the simulation when it is determined that a current state of the environment in which the control target acts, calculated in the simulation, is similar to any one or more states that are preset as states requiring interruption to a certain degree or more;
A learning device according to any one of claims 2 to 11.

a state presenting means for presenting a state of the environment to a user;
a state designation receiving means for receiving a user operation for designating a state in which the interruption is required;
The learning device of claim 12 further comprising:

The learning device according to claim 2 , further comprising: a restart state determination means for determining a state of an environment in which the controlled object acts when restarting an interrupted simulation.

the restart state determination means selects one of the reset destination candidates based on a probability distribution set for the reset destination candidates, which are executed time steps whose states are stored as time steps at which the simulation can be restarted within the episode that was executed until the simulation was interrupted;
The learning device according to claim 14.

The restart state determination means selects one of a plurality of reset destination candidates in accordance with a uniform probability distribution.
The learning device according to claim 15.

A learning device and a control device are provided,
The learning device includes:
a policy generating means for generating a policy, which is a decision rule for the action, based on a reward value, which is a value indicating an evaluation of the action of the controlled object in a state going back a time step in an episode representing a learning period;
an action decision means for deciding an action of the control target based on the measure;
Equipped with
The control device controls a control target based on the policy obtained using the learning device.
Control system.

A state presentation means for presenting to a user a state of an environment in which the controlled object acts;
a state designation receiving means for receiving a user operation for designating a state in which the simulator that simulates the environment needs to be interrupted;
An input/output device comprising:

The computer
generating a policy, which is a decision rule for the action, based on a reward value, which is a value indicating an evaluation of the action of the controlled object, in a state going back a time step in the episode representing the learning period;
determining an action of the controlled object based on the policy;
A learning method that includes:

On the computer,
generating a policy, which is a decision rule for the action, based on a reward value, which is a value indicating an evaluation of the action of the controlled object in a state going back in time steps within an episode representing a learning period;
determining an action of the controlled object based on the policy;
A recording medium storing a program for executing the above.