JP3703821B2

JP3703821B2 - Parallel learning device, parallel learning method, and parallel learning program

Info

Publication number: JP3703821B2
Application number: JP2003310383A
Authority: JP
Inventors: 英治内部; 賢治銅谷
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2003-09-02
Filing date: 2003-09-02
Publication date: 2005-10-05
Anticipated expiration: 2023-09-02
Also published as: JP2005078516A

Description

本発明は、与えられたタスクを達成するための行動方策を学習する並列学習装置、並列学習方法及び並列学習プログラムに関するものである。 The present invention relates to a parallel learning device, a parallel learning method, and a parallel learning program for learning an action policy for achieving a given task.

ミンスキーは、人間社会と同様に人間の心も、様々なエージェントが協調したり競合したりして動かしており、知能を単純なエージェントの集まりとして捉え、エージェント間の相互作用の結果、全体としての振る舞いを生成していると提唱している。この考え方は計算論的神経科学の分野でも注目を集めており、運動手続きの学習の研究でも、複数の学習モジュールがそれぞれ同時に並行して異なる座標系で学習し、それぞれ系列の学習に貢献していることが示唆されている。 Minsky, as well as human society, moves the human mind as various agents collaborate and compete with each other, see intelligence as a collection of simple agents, and as a result of the interaction between agents, It is advocated that it generates behavior. This idea is also attracting attention in the field of computational neuroscience, and even in the study of motor procedure learning, multiple learning modules can simultaneously learn in different coordinate systems and contribute to learning of each series. It is suggested that

また、強化学習を使って、複雑な行動を学習する課題に対しても、複数の学習器を準備し、それを切り替える方法が既にいくつか提案されている。例えば、複数の学習器をＴＤ誤差に応じて切り替える方法（非特許文献１参照）や、制御対象の予測モデルと強化学習器とを組にしたモジュールを並列に用い、それらを予測モデルの予測誤差に基づいて切り替えて組み合わせる方法（非特許文献２参照）が提案されている。
エスピーシン（S. P. Singh）、「エレメンタルシーケンシャルタスクの解法の組み立てによる学習転送」（Transfer of learning by composing solutions of elemental sequential tasks）、マシンラーニング（Machine Learning）、１９９２年、ｖｏｌ．３、ｐ．９−ｐ．４４ケイドウヤ（K. Doya）他、「複数モデルに基づく強化学習」（Multiple Model-Based Reinforcement Learning）、ニューラルコンピューテーション（Neural Computation）、２００２年、ｖｏｌ．１４、ｐ．１３４７−ｐ．１３６９ Also, several methods for preparing a plurality of learning devices and switching them have already been proposed for a task of learning complex behavior using reinforcement learning. For example, a method of switching a plurality of learners according to a TD error (see Non-Patent Document 1) or a module in which a prediction model to be controlled and a reinforcement learner are paired is used in parallel, and the prediction error of the prediction model is used. A method of switching and combining based on (see Non-Patent Document 2) has been proposed.
SP Singh, “Transfer of learning by composing solutions of elemental sequential tasks”, Machine Learning, 1992, vol. 3, p. 9-p. 44 K. Doya et al., “Multiple Model-Based Reinforcement Learning”, Neural Computation, 2002, vol. 14, p. 1347-p. 1369

しかしながら、上記の従来手法では、各学習器が同じ構造を有し、同じ学習法を使用しているため、学習器全体の学習効率は１個の学習器により学習する場合と何ら変わらず、複数の学習器を効率的に学習させることはできない。 However, in the above conventional method, since each learning device has the same structure and uses the same learning method, the learning efficiency of the entire learning device is not different from the case of learning with one learning device, and there are a plurality of learning devices. It is not possible to train the learners efficiently.

本発明の目的は、複数の学習手段を効率的に学習させることにより、タスクに適した構造を獲得するまでの学習時間を大幅に短縮することができる並列学習装置、並列学習方法及び並列学習プログラムを提供することである。 An object of the present invention is to provide a parallel learning apparatus, a parallel learning method, and a parallel learning program capable of significantly reducing a learning time until a structure suitable for a task is acquired by efficiently learning a plurality of learning means. Is to provide.

本発明に係る並列学習装置は、与えられたタスクを達成するための行動方策を学習する並列学習装置であって、外界の状態を取得する取得手段と、取得手段により取得された外界の状態に基づいて学習し、学習した結果から行動方策を決定する複数の学習手段と、複数の学習手段が決定した複数の行動方策の中から各学習手段の学習性能に基づいて一の行動方策を選択する選択手段とを備え、選択手段が、複数の学習手段の中から各学習手段の学習性能に基づいて一の学習手段を選択し、取得手段が、外界の状態を取得し、複数の学習手段の各々が、取得手段により取得された外界の状態に基づいて他の学習手段と同時に学習し、学習した結果から行動方策を決定し、選択手段が、選択した一の学習手段により決定された行動方策を出力し、複数の学習手段の各々が、重点サンプリング法を用いて、当該学習手段が決定した行動方策と選択手段により選択された一の学習手段の行動方策との類似度に応じて学習に用いるパラメータに重み付けを行うことにより、学習に用いるパラメータを補正する処理を繰り返すものである。 A parallel learning device according to the present invention is a parallel learning device that learns an action policy for achieving a given task, and includes an acquisition unit that acquires an external state, and an external state acquired by the acquisition unit. select a plurality of learning means, one of the action policy based on the learning performance of each learning means from among a plurality of action policy in which a plurality of learning means has determined that learns to determine the action policy from a result of learning based Selecting means for selecting one learning means based on the learning performance of each learning means from among the plurality of learning means, the acquiring means acquiring the state of the outside world, and the plurality of learning means Each learning at the same time as other learning means based on the state of the external world acquired by the acquisition means, determining an action policy from the learning result, and the action determined by the selection learning means selected by the selection means Output the strategy, Each of the learning means weights the parameters used for learning according to the similarity between the action policy determined by the learning means and the action policy of one learning means selected by the selection means, using the importance sampling method By repeating the above, the process of correcting the parameters used for learning is repeated .

本発明に係る並列学習装置では、外界の状態が取得され、取得された外界の状態に基づいて複数の学習手段が同時に学習し、学習した結果から行動方策が決定され、決定された複数の行動方策の中から各学習手段の学習性能に基づいて一の行動方策が選択され、選択された行動方策に従う行動が実行される。 In the parallel learning device according to the present invention, the state of the outside world is acquired, a plurality of learning means learn simultaneously based on the acquired state of the outside world, the action policy is determined from the learning result, and the plurality of determined actions One action policy is selected from the policies based on the learning performance of each learning means, and an action according to the selected action policy is executed.

上記の処理が繰り返されることにより、選択された学習手段が決定した行動方策により得られた経験から、選択されていない他の学習手段も学習し、タスクを達成するための行動方策を複数の学習手段が同時に学習することができるので、複数の学習器を効率的に学習させることができ、タスクに適した構造を獲得するまでの学習時間を大幅に短縮することができる。 By repeating the above process, other learning means that have not been selected are also learned from the experience obtained by the action policy determined by the selected learning means, and multiple action strategies for achieving the task are learned. Since the means can learn at the same time, a plurality of learners can be efficiently learned, and the learning time until a structure suitable for a task is acquired can be greatly shortened.

選択手段は、複数の学習手段の中から、学習性能が最も高い学習手段が一つある場合はこの学習手段を選択し、学習性能が高い学習手段が複数あり且つこれらの学習手段の学習性能が所定範囲内にある場合はこれらの学習手段から一の学習手段を等確率になるように選択することが好ましい。この場合、学習性能が所定範囲内にある学習手段の中から一の学習手段を確率的に選択することができるので、複数の学習手段を効率的に学習させることができる。 When there is one learning means with the highest learning performance among the plurality of learning means, the selection means selects this learning means, and there are a plurality of learning means with high learning performance, and the learning performance of these learning means is When it is within the predetermined range, it is preferable to select one learning means from these learning means with equal probability . In this case, since one learning means can be selected probabilistically from learning means whose learning performance is within the predetermined range, a plurality of learning means can be efficiently learned.

複数の学習手段の各々は、状態表現及び学習方法の少なくとも一方が他の学習手段と異なることが好ましい。この場合、学習特性の異なる複数の学習手段を用いて学習することができるので、例えば、単純な構成の学習手段が迅速に収集したデータを複雑な構成の学習手段に利用することができるので、学習速度を向上することができるとともに、学習性能を向上することができる。 Each of the plurality of learning means is preferably different from other learning means in at least one of state expression and learning method. In this case, since learning can be performed using a plurality of learning means having different learning characteristics, for example, the data quickly collected by the learning means having a simple configuration can be used for the learning means having a complicated configuration. The learning speed can be improved and the learning performance can be improved.

複数の学習手段の各々は、取得手段により取得された外界の状態に基づいて、所定のパラメータを用いて学習性能を評価するための価値関数を算出する算出手段と、取得手段により取得された外界の状態及び算出手段により算出された価値関数に基づいて行動方策を決定する決定手段と、取得手段により取得された外界の状態、決定手段により決定された行動方策及び選択手段により選択された行動方策に基づいて算出手段のパラメータを補正する補正手段とを備えることが好ましい。 Each of the plurality of learning means includes a calculation means for calculating a value function for evaluating learning performance using a predetermined parameter based on a state of the external world acquired by the acquisition means, and an external environment acquired by the acquisition means Determining means for determining an action policy based on the status function and the value function calculated by the calculating means; an external state acquired by the acquiring means; an action policy determined by the determining means; and an action policy selected by the selecting means It is preferable to include a correction unit that corrects the parameter of the calculation unit based on the above.

この場合、取得された外界の状態と、外界の状態及び価値関数に基づいて決定された行動方策と、選択された行動方策とに基づいて、価値関数を算出するために使用するパラメータを補正しているので、選択された学習手段が決定した行動方策により得られた経験から、選択されていない他の学習手段も学習することができる。 In this case, the parameters used for calculating the value function are corrected based on the acquired external state, the action policy determined based on the external state and the value function, and the selected action policy. Therefore, other learning means that are not selected can be learned from the experience obtained by the action policy determined by the selected learning means.

複数の学習手段のうちの少なくとも一の学習手段は、決定手段により決定された行動方策を記憶する記憶手段をさらに備えることが好ましい。この場合、学習手段が記憶手段を備えているので、部分観測マルコフ決定問題を取り扱うことができる。 Preferably, at least one learning means of the plurality of learning means further includes a storage means for storing the action policy determined by the determination means. In this case, since the learning means includes the storage means, it is possible to handle the partial observation Markov decision problem.

本発明に係る並列学習方法は、取得手段、複数の学習手段及び選択手段を備える並列学習装置を用いて、与えられたタスクを達成するための行動方策を学習する並列学習方法であって、選択手段が、複数の学習手段の中から各学習手段の学習性能に基づいて一の学習手段を選択する選択ステップと、取得手段が、外界の状態を取得する取得ステップと、複数の学習手段の各々が、取得ステップにおいて取得された外界の状態に基づいて他の学習手段と同時に学習し、学習した結果から行動方策を決定する学習ステップと、選択手段が、選択ステップにおいて選択した一の学習手段により決定された行動方策を出力するステップと、複数の学習手段の各々が、重点サンプリング法を用いて、当該学習手段が決定した行動方策と選択ステップにおいて選択された一の学習手段の行動方策との類似度に応じて学習に用いるパラメータに重み付けを行うことにより、学習に用いるパラメータを補正するステップとを繰り返すものである。 Parallel learning method according to the present invention, the acquisition means, a parallel learning method using a parallel learning device comprising a plurality of learning means and selection means learns the action policy for accomplishing a given task, selecting Each of the selection step in which the means selects one learning means from the plurality of learning means based on the learning performance of each learning means, the acquisition step in which the acquisition means acquires the state of the outside world, and each of the plurality of learning means However, the learning step of learning simultaneously with other learning means based on the state of the external world acquired in the acquisition step, and determining the action policy from the learning result, and the selection means by one learning means selected in the selection step The step of outputting the determined action policy, and each of the plurality of learning means uses the importance sampling method to the action policy and selection step determined by the learning means. By performing weighting parameters used for learning in accordance with the degree of similarity between the action policy of one learning means selected Te, but repeating the step of correcting the parameters used for learning.

本発明に係る並列学習プログラムは、与えられたタスクを達成するための行動方策を学習するための並列学習プログラムであって、外界の状態を取得する取得手段と、取得手段
により取得された外界の状態に基づいて学習し、学習した結果から行動方策を決定する複数の学習手段と、複数の学習手段が決定した複数の行動方策の中から各学習手段の学習性能に基づいて一の行動方策を選択する選択手段としてコンピュータを機能させ、選択手段が、複数の学習手段の中から各学習手段の学習性能に基づいて一の学習手段を選択し、取得手段が、外界の状態を取得し、複数の学習手段の各々が、取得手段により取得された外界の状態に基づいて他の学習手段と同時に学習し、学習した結果から行動方策を決定し、選択手段が、選択した一の学習手段により決定された行動方策を出力し、複数の学習手段の各々が、重点サンプリング法を用いて、当該学習手段が決定した行動方策と選択手段により選択された一の学習手段の行動方策との類似度に応じて学習に用いるパラメータに重み付けを行うことにより、学習に用いるパラメータを補正する処理を繰り返すものである。 A parallel learning program according to the present invention is a parallel learning program for learning an action policy for achieving a given task, and includes an acquisition unit that acquires an external state, and an external environment acquired by the acquisition unit. learns based on the state, a plurality of learning means for determining the action policy from a result of learning, one action policy based from among a plurality of action policy in which a plurality of learning means has determined the learning performance of each learning means The selection unit selects one learning unit based on the learning performance of each learning unit from the plurality of learning units, the acquisition unit acquires the external state, Each of the plurality of learning means learns simultaneously with other learning means based on the state of the external world acquired by the acquisition means, determines an action policy from the learning result, and the selection means selects one learning The action policy determined by the step is output, and each of the plurality of learning means uses the importance sampling method to determine the action policy determined by the learning means and the action policy of the one learning means selected by the selection means. The process of correcting the parameters used for learning is repeated by weighting the parameters used for learning according to the degree of similarity .

本発明によれば、選択された学習手段が決定した行動方策により得られた経験から、選択されていない他の学習手段も学習し、タスクを達成するための行動方策を複数の学習手段が同時に学習することができるので、複数の学習器を効率的に学習させることができ、タスクに適した構造を獲得するまでの学習時間を大幅に短縮することができる。 According to the present invention, other learning means that are not selected are also learned from the experience obtained by the action policy determined by the selected learning means, and the plurality of learning means simultaneously execute the action policy for achieving the task. Since learning can be performed, a plurality of learners can be efficiently learned, and the learning time until a structure suitable for a task is acquired can be greatly shortened.

以下、本発明の一実施の形態による並列学習装置について図面を参照しながら説明する。図１は、本発明の一実施の形態による並列学習装置を用いた学習システムの構成を示すブロック図である。 Hereinafter, a parallel learning apparatus according to an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing a configuration of a learning system using a parallel learning device according to an embodiment of the present invention.

図１に示す学習システムは、センサ部１、並列学習装置２及びアクチュエータ部３を備える。並列学習装置２は、ＲＯＭ（リードオンリメモリ）、ＣＰＵ（中央演算処理装置）、ＲＡＭ（ランダムアクセスメモリ）等を備える通常のマイクロコンピュータ、Ａ／Ｄ（アナログ／デジタル）変換器、Ｄ／Ａ（デジタル／アナログ）変換器等から構成され、ＲＯＭに記憶された並列学習プログラムをＣＰＵにおいて実行することにより、状態取得部１１、確率的選択器１２、切り替え器１３及びｎ個の学習器２１〜２ｎとして機能し、与えられたタスクを達成するための行動方策を学習する。 The learning system shown in FIG. 1 includes a sensor unit 1, a parallel learning device 2, and an actuator unit 3. The parallel learning device 2 includes a normal microcomputer including a ROM (read only memory), a CPU (central processing unit), a RAM (random access memory), an A / D (analog / digital) converter, a D / A ( The state acquisition unit 11, the probabilistic selector 12, the switch 13, and the n learning devices 21 to 2n are configured by a CPU and a parallel learning program that is configured by a digital / analog) converter and stored in the ROM. Learn how to act and act to achieve a given task.

センサ部１は、外界の状態を検出する種々のセンサ等から構成され、アクチュエータ部３は、与えられたタスクを達成するための行動方策に従う所定の行動を実行する種々のアクチュエータ等から構成される。例えば、学習システムが自律走行ロボットとして構成された場合、センサ部１として、外界の画像を撮影するカメラ、移動距離を検出する距離センサ、車輪の回転量を積算して初期位置からの移動量を計算するオドメトリ等を用いることができ、アクチュエータ部３として、任意の方向に移動するための車輪及びモータ等を用いることができる。 The sensor unit 1 is composed of various sensors that detect the state of the outside world, and the actuator unit 3 is composed of various actuators that execute a predetermined action according to an action policy for achieving a given task. . For example, when the learning system is configured as an autonomous traveling robot, the sensor unit 1 can be a camera that captures an image of the outside world, a distance sensor that detects a moving distance, a wheel rotation amount, and an amount of movement from an initial position. Odometry to be calculated can be used, and as the actuator unit 3, wheels and motors for moving in an arbitrary direction can be used.

センサ部１は、外界の状態を検出して状態取得部１１へ出力する。状態取得部１１は、センサ部１により検出された外界の状態を取得してｎ個の学習器２１〜２ｎへ出力する。各学習器２１〜２ｎは、補正器３１、価値関数部３２及び制御器３３を備える。但し、価値関数部３２及び制御器３３の具体的構成は互いに異なり、状態表現及び学習方法の少なくとも一方が他の学習器と異なる。各学習器２１〜２ｎは、取得された外界の状態に基づいて同時に学習し、学習した結果から行動方策を決定して切り替え器１３へ出力する。 The sensor unit 1 detects the state of the outside world and outputs it to the state acquisition unit 11. The state acquisition unit 11 acquires the state of the outside world detected by the sensor unit 1 and outputs it to the n learning devices 21 to 2n. Each of the learning devices 21 to 2n includes a corrector 31, a value function unit 32, and a controller 33. However, the specific structures of the value function unit 32 and the controller 33 are different from each other, and at least one of the state expression and the learning method is different from other learning devices. Each of the learning devices 21 to 2n learns simultaneously based on the acquired state of the outside world, determines an action policy from the learned result, and outputs the action policy to the switching device 13.

また、各学習器２１〜２ｎは、自身が決定した行動方策と切り替え器１３から出力される行動方策との類似度に応じて重み付けを行い、学習に用いるパラメータを補正する。ここで、各学習器２１〜２ｎは、後述する重点サンプリング（importance sampling）法を用いて重み付けを行うことが好ましい。 Each of the learning devices 21 to 2n performs weighting according to the degree of similarity between the action policy determined by itself and the action policy output from the switcher 13, and corrects the parameters used for learning. Here, each of the learning devices 21 to 2n is preferably weighted using an importance sampling method described later.

価値関数部３２は、状態取得部１１からの外界の状態に基づいて学習性能を評価するための価値関数を所定のパラメータを用いて算出し、算出した価値関数を制御器３３及び確率的選択器１２へ出力する。制御器３３は、状態取得部１１からの外界の状態及び価値関数部３２により算出された価値関数に基づいて行動方策を決定し、決定した行動方策を切り替え器１３へ出力する。補正器３１は、価値関数部３２から現在設定されているパラメータを読み出し、状態取得部１１からの外界の状態、制御器３３により決定された行動方策並びに確率的選択器１２及び切り替え器１３により選択された学習器の行動方策に基づいて価値関数部３２のパラメータを補正してパラメータを更新する。 The value function unit 32 calculates a value function for evaluating the learning performance based on the state of the external world from the state acquisition unit 11 using predetermined parameters, and the calculated value function is calculated by the controller 33 and the stochastic selector. 12 is output. The controller 33 determines an action policy based on the state of the outside world from the state acquisition unit 11 and the value function calculated by the value function unit 32, and outputs the determined action policy to the switcher 13. The corrector 31 reads the currently set parameters from the value function unit 32, selects the external state from the state acquisition unit 11, the action policy determined by the controller 33, and the stochastic selector 12 and the switch 13. The parameters of the value function unit 32 are corrected based on the learned learning device action policy, and the parameters are updated.

確率的選択器１２は、各学習器２１〜２ｎの価値関数部３２から価値関数を取得し、取得した価値関数を基に複数の学習器２１〜２ｎの中から最適な行動方策を決定した一の学習器を選択するように切り替え器１３の動作を制御する。例えば、確率的選択器１２は、学習性能が最も高い学習器が一つある場合はこの学習器の行動方策を選択し、学習性能が高い学習器が複数あり且つこれらの学習器の学習性能が所定範囲内にある場合はこれらの学習器の行動方策の中から一の行動方策を略等確率で選択するように切り替え器１３の動作を制御する。 The probabilistic selector 12 acquires the value function from the value function unit 32 of each of the learners 21 to 2n, and determines the optimum action policy from the plurality of learners 21 to 2n based on the acquired value function. The operation of the switcher 13 is controlled so as to select the learning device. For example, if there is one learner with the highest learning performance, the stochastic selector 12 selects an action policy of this learner, and there are a plurality of learners with high learning performance, and the learning performance of these learners is When it is within the predetermined range, the operation of the switcher 13 is controlled so that one action policy is selected from the action policies of these learning devices with a substantially equal probability.

切り替え器１３は、複数の行動方策の中から確率的選択器１２に指示された学習器の行動方策を選択し、選択した行動方策を各学習器２１〜２ｎへ出力するとともに、選択した行動方策に従う行動をアクチュエータ部３に実行させる。アクチュエータ部３は、選択した行動方策に従う行動を実行する。この行動により外界の状態が変化し、この変化をセンサ部１によって検出して上記の処理を繰り返すことにより、複数の学習器２１〜２ｎが同時に学習することとなる。 The switcher 13 selects the action policy of the learning device instructed by the probabilistic selector 12 from the plurality of action policies, outputs the selected action policy to each of the learners 21 to 2n, and selects the selected action policy. The actuator unit 3 is caused to execute an action according to the above. The actuator unit 3 executes an action according to the selected action policy. By this action, the state of the outside world changes. By detecting this change by the sensor unit 1 and repeating the above processing, the plurality of learning devices 21 to 2n learn simultaneously.

例えば、複数の学習器Ｍ^ｉ（ｉ＝１，…，ｎ）は、価値関数法又は方策勾配法を用いて与えられたタスクを達成するための制御方策π^ｉを学習し、各学習器Ｍ^ｉの状態価値関数をＶ^ｉとすると、並列学習装置２は、各エピソードごとに初期観測ｘ_０を基に、下記の確率に従って学習器Ｍ^ｉを選択する。 For example, a plurality of learners M ⁱ (i = 1,..., N) learn a control policy π ⁱ for achieving a given task using a value function method or a policy gradient method, and each learner M ^When the state value function of ⁱ is V ⁱ , the parallel learning device 2 selects the learning device M ⁱ according to the following probability based on the initial observation x ₀ for each episode.

ここで、Ｔ_Ｓｅｌは、選択確率のランダムさを制御するパラメータであり、大きければランダムに学習器を選択する傾向がある。選ばれた学習器の行動方策を挙動方策（behavior policy）といい、π^Ｂｅｈと表記する。並列学習装置２は、π^Ｂｅｈを用いて得られたエピソードで個々の学習器Ｍ^ｉの目的方策（target policy）を評価する。 Here, T _Sel is a parameter for controlling the randomness of the selection probability, and if it is large, there is a tendency to select a learning device at random. The action policy of the selected learning device is referred to as a behavior policy and is denoted as π ^Beh . The parallel learning device 2 evaluates the target policy of each learner M ⁱ using episodes obtained using π ^Beh .

なお、学習器の構成は、上記の例に特に限定されず、種々の変更が可能であり、例えば、下記の外部メモリを付加してもよい。この場合、部分観測マルコフ決定問題（ＰＯＭＤＰ：Partially Observable Markov Decision Process）を取り扱うことができる。 The configuration of the learning device is not particularly limited to the above example, and various modifications are possible. For example, the following external memory may be added. In this case, a partially observable Markov decision process (POMDP: Partially Observable Markov Decision Process) can be handled.

図２は、学習器の他の構成を示すブロック図である。図２に示す学習器２１ａと図１に示す学習器２１〜２ｎとで異なる点は、外部メモリ３４が付加され、補正器３１、価値関数部３２及び制御器３３が補正器３１ａ、価値関数部３２ａ及び制御器３３ａに変更された点であり、以下異なる点について詳細に説明する。 FIG. 2 is a block diagram showing another configuration of the learning device. The learning device 21a shown in FIG. 2 differs from the learning devices 21 to 2n shown in FIG. 1 in that an external memory 34 is added, and the corrector 31, the value function unit 32, and the controller 33 are the corrector 31a and the value function unit. The points that have been changed to 32a and the controller 33a will be described in detail below.

外部メモリ３４は、ｌビットの記憶容量を有し、制御器３３ａが決定した行動方策を記憶し、記憶している行動方策を補正器３１ａ、価値関数部３２ａ及び制御器３３ａへ出力する。また、外部メモリ３４には状態取得部１１からの外界の状態が入力され、外界の状態を記録することもできる。価値関数部３２ａは、状態取得部１１からの外界の状態及び外部メモリ３４からの行動方策に基づいて学習性能を評価するための価値関数を所定のパラメータを用いて算出し、算出した価値関数を制御器３３ａ及び確率的選択器１２へ出力する。制御器３３ａは、状態取得部１１からの外界の状態、外部メモリ３４からの行動方策及び価値関数部３２ａにより算出された価値関数に基づいて行動方策を決定し、決定した行動方策を切り替え器１３へ出力する。補正器３１ａは、価値関数部３２ａから現在設定されているパラメータを読み出し、状態取得部１１からの外界の状態、外部メモリ３４からの行動方策、制御器３３ａにより決定された行動方策及び切り替え器１３から出力される学習器の行動方策に基づいて価値関数部３２ａのパラメータを補正してパラメータを更新する。 The external memory 34 has a storage capacity of 1 bit, stores the action policy determined by the controller 33a, and outputs the stored action policy to the corrector 31a, the value function unit 32a, and the controller 33a. In addition, the external memory 34 receives the external state from the state acquisition unit 11 and can record the external state. The value function unit 32a calculates a value function for evaluating learning performance based on the state of the external world from the state acquisition unit 11 and the action policy from the external memory 34 using predetermined parameters, and calculates the calculated value function. It outputs to the controller 33a and the stochastic selector 12. The controller 33a determines the action policy based on the external state from the state acquisition unit 11, the action policy from the external memory 34, and the value function calculated by the value function unit 32a, and the determined action policy is switched to the switch 13. Output to. The corrector 31a reads the currently set parameters from the value function unit 32a, the state of the external world from the state acquisition unit 11, the action policy from the external memory 34, the action policy determined by the controller 33a, and the switcher 13. The parameters of the value function unit 32a are corrected by updating the parameters of the value function unit 32a based on the action policy of the learning device that is output from.

上記の構成により、学習器２１ａは、時刻ｔにおいてセンサ部１によって得られる環境の状態ｏ_ｔ以外に、外部メモリ３４の情報ｍ_ｔを利用でき、制御器３３ａの行動方策ａ_ｔは、実際に状態遷移を引き起こすアクチュエータ部３による環境への行動出力ａ^ｅ _ｔと、メモリビットを操作する行動方策ａ^ｍ _ｔから構成される。 With the above configuration, the learning unit 21a, in addition to the state o _t of the resulting environmental by the sensor unit 1 at time t, can use the information m _t of the external memory 34, action policy a _t the controller 33a is actually and action output a ^e _t to the environment by the actuator unit 3 to cause a state transition, and a action policy a ^m _t to manipulate the memory bit.

この場合、学習器２１ａで利用される観測量ｘ_ｔは、環境の状態ｏ_ｔと外部メモリ３４の情報ｍ_ｔとの組み合わせで表現され、ｘ_ｔ＝（ｏ_ｔ，ｍ_ｔ）となる。外部メモリ３４の各ビットは１又は０をとるため、情報ｍ_ｔは全部で２^ｌ個となる。また、行動方策ａ^ｍ _ｔは外部メモリ３４の各ビットを１にする行動と０にする行動とを有するため、全部で２ｌ個となる。なお、ａ^ｅ _ｔとａ^ｍ _ｔとの組み合わせで学習器２１ａの行動方策ａ_ｔとすることもできるが、複雑さを抑えるためにａ^ｍ _ｔをａ^ｅ _ｔと同様のプリミティブな行動の一つとして付加するようにしてもよい。 In this case, the observed amount _{x t} to be used in the learning unit 21a is expressed in combination with information _{m t} of the state of the environment _{o t} and the external memory _34, x t _{= (o} t, _{m t)} becomes. Since each bit of the external memory 34 takes 1 or 0, the information m _t is 2 ^{l in} total. Further, since the action policy a ^m _t has an action of setting each bit of the external memory 34 to 1 and an action of setting it to 0, the total number of actions is 2l. ^{Incidentally,} one of the combinations can also be a behavioral measures _{a t} learners 21a, similar primitives and the ^{^a} _{m _t} ^a _e _t in order to suppress the complexity behavior of ^a _{e t} and ^a _{m t} You may make it add as.

本実施の形態では、状態取得部１１が取得手段の一例に相当し、学習器２１〜２ｎ，２１ａが学習手段の一例に相当し、確率的選択器１２及び切り替え器１３が選択手段の一例に相当し、価値関数部３２，３２ａが算出手段の一例に相当し、制御器３３，３３ａが決定手段の一例に相当し、補正器３１，３１ａが補正手段の一例に相当する。また、外部メモリ３４が記憶手段の一例に相当する。 In the present embodiment, the state acquisition unit 11 corresponds to an example of an acquisition unit, the learning devices 21 to 2n and 21a correspond to an example of a learning unit, and the probabilistic selector 12 and the switch 13 are an example of a selection unit. The value function units 32 and 32a correspond to an example of a calculation unit, the controllers 33 and 33a correspond to an example of a determination unit, and the correctors 31 and 31a correspond to an example of a correction unit. The external memory 34 corresponds to an example of a storage unit.

次に、上記のように構成された学習システムの並列学習処理について説明する。図３は、図１に示す並列学習装置の並列学習処理を説明するためのフローチャートである。 Next, the parallel learning process of the learning system configured as described above will be described. FIG. 3 is a flowchart for explaining parallel learning processing of the parallel learning apparatus shown in FIG.

まず、ステップＳ１において、確率的選択器１２は、各学習器２１〜２ｎの学習性能を基に一の学習器を確率的に選択する。具体的には、確率的選択器１２は、学習性能が最も高い学習器が一つある場合はこの学習器を選択し、学習性能が高い学習器が複数あり且つこれらの学習器の学習性能が所定範囲内にある場合はこれらの学習器から一の学習器を等確率になるように選択する。 First, in step S1, the stochastic selector 12 probabilistically selects one learner based on the learning performance of each of the learners 21 to 2n. Specifically, the stochastic selector 12 selects this learning device when there is one learning device with the highest learning performance, and there are a plurality of learning devices with high learning performance, and the learning performance of these learning devices is If it is within the predetermined range, one learner is selected from these learners with equal probability.

学習器が選択された後、ステップＳ２において、状態取得部１１は、センサ部１が検出した外界の状態を取得し、各学習器２１〜２ｎの価値関数部３２に与える。 After the learning device is selected, in step S2, the state acquisition unit 11 acquires the state of the outside world detected by the sensor unit 1, and gives it to the value function unit 32 of each learning device 21 to 2n.

次に、ステップＳ３において、各学習器２１〜２ｎの価値関数部３２は、状態取得部１１からの外界の状態に基づいて価値関数を算出し、算出した価値関数を制御器３３へ出力し、制御器３３は、状態取得部１１からの外界の状態及び価値関数部３２により算出された価値関数に基づいて行動方策を決定し、決定した行動方策を切り替え器１３へ出力する。このとき、確率的選択器１２は、ステップＳ１で選択した学習器の行動方策をアクチュエータ部３へ出力するように切り替え器１３を制御して行動方策を決定する。 Next, in step S3, the value function unit 32 of each of the learning devices 21 to 2n calculates a value function based on the state of the external world from the state acquisition unit 11, and outputs the calculated value function to the controller 33. The controller 33 determines an action policy based on the state of the outside world from the state acquisition unit 11 and the value function calculated by the value function unit 32, and outputs the determined action policy to the switcher 13. At this time, the stochastic selector 12 determines the action policy by controlling the switch 13 so that the action policy of the learning device selected in step S1 is output to the actuator unit 3.

次に、ステップＳ４において、切り替え器１３は、アクチュエータ部３を駆動し、確率的選択器１２により選択されている学習器の行動方策に従う行動をアクチュエータ部３に実行させ、アクチュエータ部３は、選択した行動方策に従う行動を実行する。 Next, in step S4, the switching unit 13 drives the actuator unit 3 to cause the actuator unit 3 to execute an action in accordance with the behavior policy of the learning device selected by the stochastic selector 12, and the actuator unit 3 selects Execute the action according to the action policy.

次に、ステップＳ５において、各学習器２１〜２ｎの補正器３１は、価値関数部３２から現在の各パラメータを読み出し、状態取得部１１からの外界の状態、制御器３３により決定された行動方策及び切り替え器１３により選択された学習器の行動方策に基づいて各パラメータを補正して価値関数部３２の各パラメータを更新し、重点サンプリング法による分配処理を実行する。 Next, in step S <b> 5, the corrector 31 of each of the learning devices 21 to 2 n reads the current parameters from the value function unit 32, the external state from the state acquisition unit 11, and the action policy determined by the controller 33. And each parameter is correct | amended based on the action policy of the learning device selected by the switch 13, the parameter of the value function part 32 is updated, and the distribution process by an important sampling method is performed.

ここで、上記の重点サンプリング法による分配処理について詳細に説明する。なお、以下の説明では、学習器２１〜２ｎとして図２に示す外部メモリ３４を有する学習器２１ａを用いた場合を例に説明する。 Here, the distribution processing by the above-described importance sampling method will be described in detail. In the following description, a case where the learning device 21a having the external memory 34 shown in FIG. 2 is used as the learning devices 21 to 2n will be described as an example.

時刻ｔにおける環境の状態がｓ_ｔのとき、並列学習装置２はセンサ部１によってその一部をｏ_ｔとして受け取り、そのときの外部メモリ３４の情報をｍ_ｔとすると、各学習器２１〜２ｎが取得する観測量ｘ_ｔは、ｘ_ｔ＝（ｏ_ｔ，ｍ_ｔ）となる。このとき、行動方策πに従ってアクチュエータ部３により行動ａ_ｔを出力すると、結果として環境はｓ_ｔ＋１に状態遷移し、その評価値であるスカラーの報酬ｒ_ｔを得る。行動方策πの下での状態ｓの価値Ｖ^π（ｓ）は、下式で与えられる。 When the state of the environment at time t is s _t, the parallel learning apparatus 2 receives a part of the sensor unit 1 as o _t, the information of the external memory 34 at that time and m _t, the learner 21~2n The observation amount x _t acquired by x becomes x _t = (o _t , m _t ). At this time, when outputting the action a _t the actuator unit 3 according to action policy [pi, resulting environment is state transition s _{t + 1,} to obtain a scalar reward r _t is the evaluation value. The value V ^π (s) of the state s under the action policy π is given by the following equation.

ここで、Ｒ（ｓ）は状態ｓから観測される収益であり、γは減衰率（０≦γ≦１）であり、Ｅπ｛｝は並列学習装置２が行動方策πに従うとしたときの期待値を表す。同様に行動方策πの下で状態ｓにおいて行動ａを実行することの価値は、下式で与えられる。 Here, R (s) is the revenue observed from the state s, γ is the attenuation rate (0 ≦ γ ≦ 1), and Eπ {} is the expectation when the parallel learning device 2 follows the action policy π. Represents a value. Similarly, the value of executing action a in state s under action policy π is given by:

上記のＶ^πを状態価値関数といい、Ｑ^πを行動価値関数といい、両者をまとめて価値関数という。Ｖ^π及びＱ^πを推定するために、本来の行動方策πとは異なる別の行動方策π’を用いる場合を考え、ここで、重点サンプリング法を用いることにより、目的方策πと挙動方策π’の違いに対処する。いま、挙動方策π’によって得られたｍ番目のエピソードをｈ_ｍとし、Ｔ^ｍをエピソードｈ_ｍが終了するまでの時間ステップとし、Ｐｒ^π（ｈ_ｍ）とＰｒ^π’（ｈ_ｍ）とを方策πとπ’とに従ったときにエピソードｈ_ｍが発生する確率とする。 The above V ^{π is referred} to as a state value function, Q ^π is referred to as an action value function, and both are collectively referred to as a value function. In order to estimate V ^π and Q ^π , consider a case in which another action policy π ′ different from the original action policy π is used. To deal with the difference. Now, let _{m m} the episode obtained by the behavior policy π ′ be h _m , T ^{m be} the time step until the end of episode h _m , and let Pr ^π (h _m ) and Pr ^{π ′} (h _m ) and the probability that the episode h _m occurs when, according to the policy π and π '.

このとき，Ｍ個の収益を観測した後で要求されるモンテカルロ推定は、下式で与えられる。 At this time, the Monte Carlo estimation required after observing M profits is given by the following equation.

ここで、Ｒ_ｍは実際に得られた収益Ｒ_ｍ（ｓ）＝ｒ_{ｔｍ（ｓ）}＋γｒ_{ｔｍ（ｓ）＋１}＋…＋γ^{Ｔｍ-ｔｍ（ｓ）−１}ｒ_Ｔｍ−１であり、ｔ_ｍ（ｓ）はｍ番目のエピソードｈ_ｍではじめて状態ｓが得られたときの時間ステップである。エピソードｈ_ｍが発生する確率は、下式で与えられる。 Here, R _m is actually obtained profit R _m (s) = r _{tm (s)} + γr _{tm (s) +1} +... + Γ ^{Tm−tm (s) −1} r _Tm−1 and t _m ( s) is a time step when the state s is obtained for the first time in the _mth episode hm. The probability that the episode h _m occurs is given by the following equation.

ここで、ρ_ｔは行動方策の違いを補正する係数であり、Ｐｒ^π（ｈ_ｍ）／Ｐｒ^π’（ｈ_ｍ）を計算するために環境のダイナミクスに関する知識は必要とせず、行動方策の比率だけが必要とされる。なお、π（ｓ，ａ）＞０ならばπ’（ｓ，ａ）＞０であることが要求される。 Here, ρ _t is a coefficient for correcting the difference in action policy, and knowledge of the dynamics of the environment is not required for calculating Pr ^π (h _m ) / Pr ^{π ′} (h _m ), and the ratio of action policies Only needed. If π (s, a)> 0, it is required that π ′ (s, a)> 0.

次に、学習器２１〜２ｎが強化学習に価値関数法を用いている場合において、上記の重点サンプリング法を価値関数法に適用する方法について具体的に説明する。価値関数法は、状態と行動の組に対して定義される価値Ｑ^ＶＦを、Ｂｅｌｌｍａｎ方程式を用いて推定する方法であり、代表的な方法としてＱ学習やＳＡＲＳＡがある。ＳＡＲＳＡは方策オン型の強化学習であり、Ｑ学習は方策オフ型の強化学習であり、挙動方策と推定方策を個別に持つことができる。 Next, when the learning devices 21 to 2n use the value function method for reinforcement learning, a method for applying the above-described importance sampling method to the value function method will be specifically described. The value function method is a method of estimating a value Q ^VF defined for a set of state and action using the Bellman equation, and representative methods include Q learning and SARSA. SARSA is policy-on type reinforcement learning, and Q-learning is policy-off type reinforcement learning, which can have a behavior policy and an estimation policy separately.

まず、観測値を状態とみなして定式化すると、観測値ｘ_ｔで行動ａ_ｔを実行し、報酬ｒ_ｔと次の観測値ｘ_ｔ＋１を受け取ったとき、Ｑ学習及びＳＡＲＳＡでは、ＴＤ誤差がそれぞれ下式で与えられる。 First, when formulated regarded as state observations, perform an action _{a t} the observation value _{x t,} upon receipt of a reward _{r t} and the next observation value _{x t + 1,} the Q-learning and Sarsa, TD error respectively It is given by the following formula.

ここで、Ｑ^Ｑ及びＱ^{ＳＡＲＳＡ}は、Ｑ学習及びＳＡＲＳＡを使用したときの行動価値関数である。 Here, Q ^Q and Q ^SARSA are action value functions when Q learning and SARSA are used.

価値関数法に重点サンプリング法を利用する方法としては、公知の手法を用いることができ、本実施の形態では、価値関数をルックアップテーブル形式すなわちｗ_ｋ＝Ｑ（ｘ，ａ）のように重みを割り当てており、重点サンプリングを用いた場合の行動価値関数は下式で与えられる。 As a method of using the importance sampling method for the value function method, a known method can be used. In this embodiment, the value function is weighted in the form of a lookup table, that is, w _k = Q (x, a). Is assigned, and the action value function in the case of using importance sampling is given by the following equation.

ここで、ＳＡＲＳＡと同様に環境のマルコフ性を利用すると、更新式は下式で与えられる。 Here, when the Markov property of the environment is used as in SARSA, the update formula is given by the following formula.

ここで、ｔ_ｍはｍ回目のエピソードで最初に（ｘ_ｔ，ａ_ｔ）＝（ｘ，ａ）となった時刻であり、Ｔ^ＶＦは適合度トレースであり、λ は適合度の減衰率であり、α_ＶＦは学習率である。なお、挙動方策と目標方策とが一致する場合、ρ_ｔ＝１となり、通常のＳＡＲＳＡの更新式となる。 Here, t _m is the time when (x _t , a _t ) = (x, a) first in the m-th episode, T ^VF is a fitness trace, and λ is the fitness decay rate. Yes, α _VF is the learning rate. If the behavior policy and the target policy match, ρ _t = 1, which is a normal SARSA update formula.

また、確率的行動方策は、例えば、ボルツマン分布を用いて下式で表される。 Further, the stochastic action policy is expressed by the following equation using, for example, a Boltzmann distribution.

ここで、Ｔ_ＶＦは温度パラメータであり、学習の初期段階では大きな値をとるが、学習が進むにつれて小さな値をとるように制御される。価値関数法は、環境がマルコフ決定過程（ＭＤＰ：Markov Decision Process）である場合、すなわちｘ_ｔ＝ｓ_ｔの場合には最適方策への収束性が示されている。また、ＰＯＭＤＰな環境でも、内部変数を持たない範囲ではλを適切に設定することにより最適な確率的方策を獲得することができる。 Here, _TVF is a temperature parameter, and takes a large value in the initial stage of learning, but is controlled so as to take a small value as the learning proceeds. Value function method, environment Markov Decision Process: If a (MDP Markov Decision Process), i.e. in the case of x _{t =} s _t are shown convergence to the optimal policy. Even in a POMDP environment, an optimal probabilistic policy can be obtained by appropriately setting λ within a range that does not have internal variables.

次に、学習器２１〜２ｎが強化学習に方策勾配法を用いている場合において、上記の重点サンプリング法を方策勾配法に適用する方法について具体的に説明する。従来、報酬に遅れのある問題において報酬の期待値の勾配方向へパラメータを更新する手法が提案されており、これをきっかけとして、種々の方策勾配法が提案されている。 Next, when the learning devices 21 to 2n use the policy gradient method for reinforcement learning, a method for applying the above-described importance sampling method to the policy gradient method will be specifically described. Conventionally, there has been proposed a method for updating parameters in a gradient direction of an expected value of a reward in a problem with a delay in reward, and various policy gradient methods have been proposed as a trigger.

まず、パラメータｗ_ｋにより表現された行動方策π^ＰＧをｘで期待値をとった価値関数Ｖ^ＰＧの勾配を利用して下式により改善する。 First, the action policy π ^PG expressed by the parameter w _{k is} improved by the following equation using the gradient of the value function V ^PG in which the expected value is taken as x.

ここで、α_ＰＧはステップサイズパラメータであり、ｗはｗ_ｋをまとめたパラメータベクトルである。このとき、重点サンプリング法を用いると、状態価値関数は下式で与えられる。 Here, α _PG is a step size parameter, and w is a parameter vector in which w _k are collected. At this time, when the importance sampling method is used, the state value function is given by the following equation.

ここで、Ｐｒ（ｈ_ｍ｜ｗ）はベクトルｗでパラメータ化された行動方策を用いてエピソードｈ_ｍを得る確率であり、下式で表される。 Here, Pr (h m | _w) is the probability of obtaining an episode h _m with action policy parameterized by the vector w, it is represented by the following expression.

ここで、φ（ｈ_ｍ）及びΨ（ｗ，ｈ_ｍ）は下式で与えられる。 Here, φ (h _m ) and ψ (w, h _m ) are given by the following equations.

上記のφ（ｈ_ｍ）は環境からサンプリングしなければならないが、Ψ（ｗ，ｈ_ｍ）は並列学習装置２の行動方策から計算できるので、一つのエピソードが得られたとき、行動方策を改善する方向はＶ（ｗ）をｗ_ｋで微分して下式のようになる。 The above φ (h _m ) must be sampled from the environment, but Ψ (w, h _m ) can be calculated from the action policy of the parallel learning device 2, so when one episode is obtained, the action policy is improved the direction is as shown in the following equation by differentiating V the (w) at w _k.

上記のＰｒ（ｈ｜ｗ）＝Ｐｒ（ｈ｜ｗ’）は制御方策の比率の掛け算により計算でき、方策勾配法を用いる場合の更新式は下式で与えられる。 The above Pr (h | w) = Pr (h | w ') can be calculated by multiplying the ratio of the control policy, and the update formula when the policy gradient method is used is given by the following formula.

ここで、Ｔ_ｔ（ｋ）は方策勾配法の場合の適合度トレースであり、挙動方策と目標方策とが一致する場合、ρ_ｔ＝１となる。 Here, T _t (k) is a goodness-of-fit trace in the case of the policy gradient method, and ρ _t = 1 when the behavior policy matches the target policy.

次に、方策勾配法では行動方策をパラメータ表現する必要があるが、ｗ_ｋ＝Ｐ（ｘ，ａ）のように状態及び行動の組に対して重みを割り当て、式（１３）のようにボルツマン分布を用いて下式で表される。 Next, in the policy gradient method, it is necessary to express the action policy as a parameter. However, weights are assigned to a set of states and actions as w _k = P (x, a), and Boltzmann as shown in Expression (13). It is expressed by the following formula using the distribution.

ここで、Ｐ^ＰＧ（ｘ_ｔ，ａ_ｔ）は行動価値ではなく、Ｔ_ＰＧは温度パラメータであるが、式（１３）と異なり、一定の値をとる。このとき、式（２３）の微分は下式で与えられる。 Here, P ^PG (x _t , a _t ) is not an action value, and T _PG is a temperature parameter, but takes a constant value unlike the equation (13). At this time, the derivative of the equation (23) is given by the following equation.

上記の方策勾配法では、価値関数を明示的には推定せず、オンラインで方策を更新するが、本発明ではエピソードの最初に学習器を選択するために価値関数を用いる必要があり、式（４）によって価値Ｖ^ＰＧをエピソードごとに更新する。 In the above policy gradient method, the value function is not explicitly estimated, and the policy is updated online. However, in the present invention, the value function needs to be used to select the learner at the beginning of the episode, and the formula ( The value V ^PG is updated for each episode according to 4).

再び、図３を参照して、上記の重点サンプリング法による分配処理が実行された後に、ステップＳ６において、各学習器２１〜２ｎは、現在実行しているタスクが終了したか否かを判断し、タスクが終了していない場合はステップＳ２以降の処理を繰り返し、タスクが終了した場合にステップＳ７へ処理を移行する。 Referring to FIG. 3 again, after the distribution processing by the above-described importance sampling method is performed, in step S6, each of the learning devices 21 to 2n determines whether or not the task currently being executed has ended. If the task has not ended, the processes in and after step S2 are repeated. If the task has ended, the process proceeds to step S7.

タスクが終了した場合にステップＳ７において、確率的選択器１２は、与えられたタスクに対して学習が終了したか否か、すなわち、与えられたタスクに対して必要とされる学習性能を獲得できたか否かを判断し、学習が終了していない場合はステップＳ１以降の処理を繰り返し、学習が終了した場合に処理を終了する。 When the task is finished, in step S7, the stochastic selector 12 can acquire whether or not the learning is finished for the given task, that is, the learning performance required for the given task. If the learning has not been completed, the process from step S1 is repeated, and the process is terminated when the learning is completed.

上記の処理により、本実施の形態では、状態取得部１１により外界の状態が取得され、取得された外界の状態に基づいて各学習器２１〜２ｎが同時に学習し、学習した結果から行動方策を決定し、決定された複数の行動方策の中から確率的選択器１２及び切り替え器１３により各学習器２１〜２ｎの学習性能に基づいて一の行動方策が選択され、選択された行動方策に従う行動がアクチュエータ部３により実行され、これらの処理が繰り返される。この結果、選択された学習器が決定した行動方策により得られた経験から、選択されていない他の学習器も学習し、複数の学習器２１〜２ｎが与えられたタスクを達成するための行動方策を同時に学習することができるので、複数の学習器２１〜２ｎを効率的に学習させることができ、学習器２１〜２ｎがタスクに適した構造を獲得するまでの学習時間を大幅に短縮することができる。 According to the above processing, in the present embodiment, the state acquisition unit 11 acquires the state of the outside world, and the learning devices 21 to 2n simultaneously learn based on the acquired state of the outside world, and the action policy is determined from the learning result. One action policy is selected based on the learning performance of each of the learners 21 to 2n by the probabilistic selector 12 and the switch 13 from among the plurality of determined action policies, and the action according to the selected action policy Is executed by the actuator unit 3, and these processes are repeated. As a result, other learners that have not been selected are also learned from the experience obtained by the action policy determined by the selected learner, and a plurality of learners 21 to 2n are actions for achieving a given task. Since the strategies can be learned at the same time, the plurality of learners 21 to 2n can be efficiently learned, and the learning time until the learners 21 to 2n acquire a structure suitable for the task is greatly reduced. be able to.

次に、上記の並列学習装置の学習効果について具体例を挙げて説明する。図４は、図１に示す並列学習装置を倒立振子の制御に使用した場合の学習性能を表す特性図である。図４に示す例は、台車の上に設けられたポールが直立するように台車を移動制御するものであり、状態変数の一部である台車の位置ｘ及びポールの角度θのみが観測可能なＰＯＭＤＰの場合の例である。ここで、図４の縦軸は学習性能を表すエピソード単位の総報酬を示し、横軸はエピソード数を示している。 Next, the learning effect of the parallel learning apparatus will be described with a specific example. FIG. 4 is a characteristic diagram showing learning performance when the parallel learning device shown in FIG. 1 is used for controlling an inverted pendulum. The example shown in FIG. 4 controls the movement of the carriage so that the pole provided on the carriage stands upright, and only the position x of the carriage and the angle θ of the pole, which are part of the state variables, can be observed. This is an example in the case of POMDP. Here, the vertical axis in FIG. 4 indicates the total reward for each episode representing learning performance, and the horizontal axis indicates the number of episodes.

図４に示す曲線Ａは、図１に示す並列学習装置を用いた場合の学習性能を表し、学習器２１〜２ｎとして、価値関数法を用い且つ外部メモリ３４を持たない学習器、価値関数法を用い且つ外部メモリ３４を有する学習器、方策勾配法を用い且つ外部メモリ３４を持たない学習器、及び方策勾配法を用い且つ外部メモリ３４を有する学習器を用い、重点サンプリング法を用いて４個の学習器を同時に学習させた場合の学習性能を表している。 A curve A shown in FIG. 4 represents learning performance when the parallel learning device shown in FIG. 1 is used. As the learning devices 21 to 2n, a learning device using the value function method and not having the external memory 34, the value function method. And a learning device that uses the policy gradient method and does not have the external memory 34, and a learning device that uses the policy gradient method and has the external memory 34, and uses the importance sampling method. The learning performance is shown when learning is performed simultaneously for each learning device.

一方、曲線Ｂ〜Ｆは比較例であり、曲線Ｂは、価値関数法を用い且つ外部メモリ３４を持たない学習器のみを用いた場合の学習性能を表し、曲線Ｃは、価値関数法を用い且つ外部メモリ３４を有する学習器のみを用いた場合の学習性能を表し、曲線Ｄは、方策勾配法を用い且つ外部メモリ３４を持たない学習器のみを用いた場合の学習性能を表し、曲線Ｅは、方策勾配法を用い且つ外部メモリ３４を有する学習器のみを用いた場合の学習性能を表し、曲線Ｆは、重点サンプリング法を用いることなく４個の学習器を同時に学習させた場合の学習性能を表している。 On the other hand, the curves B to F are comparative examples, the curve B represents the learning performance when only the learning device using the value function method and not having the external memory 34 is used, and the curve C uses the value function method. The curve D represents the learning performance when only the learning device having the external memory 34 is used, and the curve D represents the learning performance when only the learning device using the policy gradient method and not having the external memory 34 is used. Represents learning performance when only the learning device using the policy gradient method and having the external memory 34 is used, and the curve F represents learning when four learning devices are simultaneously learned without using the importance sampling method. Represents performance.

図４から、環境がＰＯＭＤＰの場合でも、図１に示す並列学習装置を用いた場合（曲線Ａ）、他の学習方法（曲線Ｂ〜Ｆ）に比較して学習効率が最も高く、学習時間を最も短縮することができるとともに、到達可能な学習性能が最も高いことがわかった。 From FIG. 4, even when the environment is POMDP, when the parallel learning device shown in FIG. 1 is used (curve A), the learning efficiency is the highest compared to other learning methods (curves BF), and the learning time is reduced. It was found that the learning performance can be shortened most and the reachable learning performance is the highest.

図５は、図１に示す並列学習装置を自律走行ロボットの走行制御に使用した場合の学習性能を表す特性図である。図５に示す例は、自律走行ロボットが障害物を避けながら目的位置に到達するものであり、図５の縦軸は学習性能を表す平均報酬を示し、横軸はエピソード数を示している。 FIG. 5 is a characteristic diagram showing learning performance when the parallel learning device shown in FIG. 1 is used for traveling control of an autonomous traveling robot. In the example shown in FIG. 5, the autonomous mobile robot reaches the target position while avoiding an obstacle. The vertical axis in FIG. 5 indicates the average reward indicating the learning performance, and the horizontal axis indicates the number of episodes.

図５に示す曲線Ａは、図１に示す並列学習装置を用いた場合の学習性能を表し、学習器２１〜２ｎとして、価値関数法を用いて粗い移動制御を行う学習器、価値関数法を用いて精密な移動制御を行う学習器、方策勾配法を用いて粗い移動制御を行う学習器、及び方策勾配法を用いて精密な移動制御を行う学習器を用い、重点サンプリング法を用いて４個の学習器を同時に学習させた場合の学習性能を表している。 A curve A shown in FIG. 5 represents learning performance when the parallel learning apparatus shown in FIG. 1 is used. As learning devices 21 to 2n, a learning device that performs coarse movement control using the value function method, a value function method is shown. Using a learning device that performs precise movement control using a learning device, a learning device that performs coarse movement control using a policy gradient method, and a learning device that performs precise movement control using a policy gradient method, and using an importance sampling method, 4 The learning performance is shown when learning is performed simultaneously for each learning device.

一方、曲線Ｂ，Ｃは比較例であり、曲線Ｂは、価値関数法を用いて粗い移動制御を行う学習器のみを用いた場合の学習性能を表し、曲線Ｃは、価値関数法を用いて精密な移動制御を行う学習器のみを用いた場合の学習性能を表している。 On the other hand, curves B and C are comparative examples, and curve B represents learning performance when only a learning device that performs coarse movement control using the value function method is used, and curve C uses value function method. The learning performance when only a learning device that performs precise movement control is used is shown.

図５から、図１に示す並列学習装置を自律走行ロボットに用いた場合（曲線Ａ）、他の学習方法（曲線Ｂ，Ｃ）に比較して、エピソード数の増加に伴い学習効率が急激に向上し、学習時間を最も短縮することができるとともに、到達可能な学習性能が最も高いことがわかった。 From FIG. 5, when the parallel learning apparatus shown in FIG. 1 is used for an autonomous traveling robot (curve A), the learning efficiency increases sharply as the number of episodes increases compared to other learning methods (curves B and C). It has been found that the learning performance can be shortened most and the learning performance that can be reached is the highest.

上記の実施形態では、自律走行ロボット等を対象に説明したが、本発明の適用対象は上記の例に特に限定されず、種々のものに適用可能である。例えば、ペットロボット等に本発明の並列学習装置を適用し、複数の学習器の一つとして人間の教示を導入するようにしてもよい。この場合、人間が教示した通りにペットロボットが行動しつつ、ペットロボット自体の学習も同時に実現することができ、例えば、飼い主がペットロボットに芸を教えつつ、自律学習によってより知的な行動を獲得させることができる。 In the above-described embodiment, the autonomous traveling robot or the like has been described. However, the application target of the present invention is not particularly limited to the above example, and can be applied to various types. For example, the parallel learning device of the present invention may be applied to a pet robot or the like, and a human teaching may be introduced as one of a plurality of learning devices. In this case, the pet robot can act as taught by a human and learning of the pet robot itself can be realized at the same time.For example, the owner teaches the pet robot a trick and performs more intelligent behavior by autonomous learning. Can be earned.

また、本発明の並列学習装置を最適制御分野等に適用して従来型の制御と機械学習とを融合し、工場等においてマニピュレータの制御等に利用されてきたものを学習器の制御器として利用するようにしてもよい。この場合、これまでに使用してきたものをそのまま利用できるので、従来と同じ性能を保証しながら、他の学習器が獲得したより良い性能を自動的に利用することができる。 In addition, the parallel learning device of the present invention is applied to the optimal control field, etc. to combine conventional control and machine learning, and what has been used for manipulator control etc. in factories etc. is used as the controller of the learning device You may make it do. In this case, since what has been used so far can be used as it is, it is possible to automatically use better performance acquired by other learning devices while guaranteeing the same performance as before.

さらに、本発明の並列学習装置を進化ロボティクス等の多数の学習器を評価する部分に適用してもよい。この分野では、複数の制御器を順番に一つずつ評価していたため、膨大な時間を必要としていたが、本発明の並列学習装置を用いることにより、複数の学習器を並列に評価することができるため、評価時間を大幅に短縮できる。 Furthermore, the parallel learning device of the present invention may be applied to a portion that evaluates a large number of learning devices such as evolution robotics. In this field, since a plurality of controllers were evaluated one by one in order, enormous time was required. However, by using the parallel learning device of the present invention, it is possible to evaluate a plurality of learners in parallel. This can greatly reduce the evaluation time.

本発明の一実施の形態による並列学習装置を用いた学習システムの構成を示すブロック図である。It is a block diagram which shows the structure of the learning system using the parallel learning apparatus by one embodiment of this invention. 学習器の他の構成を示すブロック図である。It is a block diagram which shows the other structure of a learning device. 図１に示す並列学習装置の並列学習処理を説明するためのフローチャートである。It is a flowchart for demonstrating the parallel learning process of the parallel learning apparatus shown in FIG. 図１に示す並列学習装置を倒立振子の制御に使用した場合の学習性能を表す特性図である。It is a characteristic view showing learning performance at the time of using the parallel learning device shown in Drawing 1 for control of an inverted pendulum. 図１に示す並列学習装置を自律走行ロボットの走行制御に使用した場合の学習性能を表す特性図である。It is a characteristic view showing learning performance at the time of using the parallel learning device shown in Drawing 1 for run control of an autonomous running robot.

Explanation of symbols

１センサ部
２並列学習装置
３アクチュエータ部
１１状態取得部
１２確率的選択器
１３切り替え器
２１〜２ｎ，２１ａ学習器
３１，３１ａ補正器
３２，３２ａ価値関数部
３３，３３ａ制御器
３４外部メモリ DESCRIPTION OF SYMBOLS 1 Sensor part 2 Parallel learning apparatus 3 Actuator part 11 State acquisition part 12 Probabilistic selector 13 Switch 21-21n, 21a Learner 31, 31a Corrector 32, 32a Value function part 33, 33a Controller 34 External memory

Claims

A parallel learning device that learns action strategies to achieve a given task,
Acquisition means for acquiring the state of the outside world;
A plurality of learning means learns, determines the action policy from a result of learning based on the state of the outside world acquired by the acquisition means,
Selecting means for selecting one action policy based on the learning performance of each learning means from a plurality of action policies determined by the plurality of learning means ,
The selection unit selects one learning unit from the plurality of learning units based on the learning performance of each learning unit, the acquisition unit acquires an external state, and each of the plurality of learning units , Learning simultaneously with other learning means based on the state of the outside world acquired by the acquisition means, determining an action policy from the learning result, and the action determined by the selection learning means selected by the selection means A policy is output, and each of the plurality of learning means uses an importance sampling method according to the similarity between the action policy determined by the learning means and the action policy of one learning means selected by the selection means A parallel learning apparatus that repeats the process of correcting parameters used for learning by weighting parameters used for learning.

The selection means selects the learning means when there is one learning means with the highest learning performance from the plurality of learning means, and there are a plurality of learning means with high learning performance and learning of these learning means. 2. The parallel learning apparatus according to claim 1, wherein when the performance is within a predetermined range, one learning means is selected from these learning means with equal probability.

Each of the plurality of learning means includes
Calculation means for calculating a value function for evaluating learning performance using a predetermined parameter based on the state of the external world acquired by the acquisition means;
Determining means for determining an action policy based on the state of the outside world acquired by the acquiring means and the value function calculated by the calculating means;
A correction unit that corrects the parameter of the calculation unit based on the state of the outside world acquired by the acquisition unit, the action policy determined by the determination unit, and the action policy selected by the selection unit; The parallel learning apparatus according to claim 1 or 2.

4. The parallel learning apparatus according to claim 3, wherein at least one learning means of the plurality of learning means further includes a storage means for storing the action policy determined by the determination means.

A parallel learning method for learning an action policy for achieving a given task using a parallel learning device comprising an acquisition means, a plurality of learning means and a selection means,
A selection step in which the selection means selects one learning means based on the learning performance of each learning means from the plurality of learning means;
The obtaining means for obtaining an external state; and
A learning step in which each of the plurality of learning means learns simultaneously with other learning means based on the state of the outside world acquired in the acquisition step, and determines an action policy from the learning result;
The selection means outputting the action policy determined by the learning means selected in the selection step;
Parameters used for learning by each of the plurality of learning means according to the degree of similarity between the action policy determined by the learning means and the action policy of one learning means selected in the selection step by using an importance sampling method A parallel learning method characterized by repeating the step of correcting parameters used for learning by weighting.

A parallel learning program for learning action strategies to accomplish a given task,
An acquisition means for acquiring the state of the outside world;
Learning based on the state of the outside world acquired by the acquisition means, a plurality of learning means for determining an action policy from the learning results;
Causing the computer to function as a selection means for selecting one action policy based on the learning performance of each learning means from the plurality of action policies determined by the plurality of learning means;
The selection unit selects one learning unit from the plurality of learning units based on the learning performance of each learning unit, the acquisition unit acquires an external state, and each of the plurality of learning units , Learning simultaneously with other learning means based on the state of the outside world acquired by the acquisition means, determining an action policy from the learning result, and the action determined by the selection learning means selected by the selection means A policy is output, and each of the plurality of learning means uses an importance sampling method according to the similarity between the action policy determined by the learning means and the action policy of one learning means selected by the selection means A parallel learning program characterized by repeating processing for correcting parameters used for learning by weighting parameters used for learning.