JP2004178492A

JP2004178492A - Plant simulation method using enhanced learning method

Info

Publication number: JP2004178492A
Application number: JP2002346993A
Authority: JP
Inventors: Toshihiro Yamashita; 利博山下; Shigeaki Nakamura; 成章中村; Masataka Abe; 正孝安部; Yoshinori Terasawa; 良則寺澤
Original assignee: Mitsubishi Heavy Industries Ltd
Current assignee: Mitsubishi Heavy Industries Ltd
Priority date: 2002-11-29
Filing date: 2002-11-29
Publication date: 2004-06-24

Abstract

<P>PROBLEM TO BE SOLVED: To solve a problem that a garbage incineration plant or the like shows cumbersome behavior, different behavior is shown even in the same plant, and behavior changes even by the long-term depletion of the plant. <P>SOLUTION: A plant simulation method using an enhanced learning method comprises the steps of: setting a value function to the initial state (S1); using a previously prepared actual plant operation data (S2); obtaining the amount of a state by conducting model calculations to the amount of inputted data operation by using a previously prepared process model (S3); calculating remuneration by using the amount of the operation, the calculated amount of a state, and the actual plant operation data (S4); learning a policy for maximizing a profit being the total of the remuneration by conducting enhanced learning based on the remuneration calculated for a plurality of parameters (S5); obtaining the profit expected in the future derived from an action in a given state as the value function; conducting simulation based on the learned parameter obtained by using the value function. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明が属する技術分野】
本発明は、プラントのシミュレータ、動作方法、及びそのプログラムに関する。特に、ごみ焼却プラント等といったオペレータの運転に熟練を要するプラントや、長期間の操業により状態が変化するプラントの動作をシミュレート（模擬）するシミュレーション方法に関する。
【０００２】
【従来の技術】
ごみ焼却炉においては、燃焼されるごみ（廃棄物）がその性状に応じて、様々な割合で構成されている。このようなゴミの成分のばらつきは、特に家庭ごみなどの一般廃棄物である場合に顕著である。このため、ごみ焼却炉自体の挙動が複雑な動きを示す。また、そのゴミ成分のばらつきの影響や運転員の操作についての癖の影響を受けるために、ごみ焼却炉は、プラントが変わっても、あるいは、同じプラント内の焼却炉毎に、挙動に癖があることが多い。従って、このようなごみ焼却炉の運転を確実に行なうためには相当な熟練を要する。ごみ焼却炉以外の従来の他のプラントであっても運転操作に熟練を要するものがある。
【０００３】
つまり、このようなプラントにおいては、多くの制御操作量を操作することによって、多数の制御された状態量の関係を読み取って運転しなければならないため、オペレータは高度の熟練を要する。
【０００４】
このようなプラントの運転訓練を行なうために訓練シミュレータを用いることが考えられるが、実用性のある訓練用のシミュレータを構築するためには、プラントの複雑な挙動をシミュレートするための基礎となるプラントモデルが必要である。例えば、特許文献１には、伝達関数とＰＩＤ制御を用いたプラントシミュレーションモデルの生成方法が記載されている。また、例えば、特許文献２には、誤差を用いて実際のプラントの動作とプラントシミュレータとをあわせこむ方法が記載されている。
【０００５】
また、このようなプラントにおいては、プロセスの状態量（温度や圧力等）が測定しにくいものや、実際に起きている複雑なプロセスが把握しきれないものもあり、上記の運転の熟練のみならず、そのプラントに必要なメンテナンスの種類や時期、あるいは、耐用期限までの期間がどの程度残されているかといった点が把握しにくいものがある。これらの不確定要素があるために、安全のために大幅に余裕を見込んで運転期間を設定したり、メンテナンス等を行なう必要があった。
【０００６】
また、学習アルゴリズムの分野において、強化学習法という手法が知られている。強化学習法は、教師無し学習（ｕｎｓｕｐｅｒｖｉｓｅｄｌｅａｒｎｉｎｇ）の手法の一つであり、ある環境（ｅｎｖｉｒｏｎｍｅｎｔ）において、学習主体となる自律的なエージェントがその環境から得られる報酬（ｒｅｗａｒｄ）や罰（ｐｅｎａｌｔｙ）を手掛りに、方策または政策（ｐｏｌｉｃｙ）を決定して、方策の与える期待収益（報酬の期待値）である価値（ｖａｌｕｅ）を極大化するような学習法であり、環境が複雑かつ不確定であってもエージェントの学習が可能であるという特徴を有する（例えば、非特許文献１）。この強化学習法を用いる例として、特許文献３に浚渫船の経路の最適化を行なう方法が開示されているが、プラント等のプロセスモデルに用いる例は開示されていない。
【０００７】
【特許文献１】
特開平７−６４６１０号公報
【特許文献２】
特開平１０−２０７５０７号公報
【特許文献３】
特開平１０−２５３６０２号公報
【非特許文献１】
電気学会ＧＡ・ニューロを用いた学習法とその応用調査専門委員会編、「学習とそのアルゴリズム」、森北出版、２００２年８月２８日、ｐ．１５５−１６４
【０００８】
【発明が解決しようとする課題】
実際のゴミ焼却プラントでは、ゴミの性状により複雑な挙動を示すことに加えて、プラントが異なると、同じように建設されたプラント同士であっても、また、同じプラント内の焼却炉であっても異なる挙動（即ち、プラントの癖、焼却炉の癖に運転員の操作の癖が相乗した挙動）を示し、プラントの経年変化によっても挙動が変化する。このように挙動が複雑である場合には、単純なモデルによってその複雑さが十分に表現されることはない。また、シミュレーション実施時点でのプラント、焼却炉の実機の経年変化等を受けた状態での挙動の模擬も行われていない。そのような実機の状態のプロセスモデルへの反映ため、プラントの不確定要素を減らすためにプラントの動作解析を逐次行う必要がある。
【０００９】
【課題を解決するための手段】
かかる課題を解決するため、本発明は、プラントの動作をシミュレートする装置等において、強化学習アルゴリズムを用いる。本発明では、プラントの操作量と状態量に対応した空間を強化学習の環境とする。
【００１０】
つまり、本発明は、（ａ）価値関数を初期状態にするステップと、（ｂ）次いで、予め準備されたプラント実機運転データを用いて、ある操作量に対して予め作成されたプロセスモデルによりモデル計算を実行して状態量を得るステップと、（ｃ）前記操作量と計算された状態量とプラント実機運転データを用いて報酬を計算するステップと、（ｄ）ステップ（ｂ）とステップ（ｃ）を、プラント実機運転データにおける操作量と状態量との関係を定めるパラメータ空間にある複数のパラメータの各々について繰り返すステップと、（ｅ）複数のパラメータに対して計算された報酬に基づいて強化学習を行なうことにより報酬の合計である収益を最大化するような方策を、前記価値関数を用いて学習するステップと、（ｆ）得られた価値関数を用いて得られる学習されたパラメータに基づきシミュレーションを行なうステップとを含んでなるプラント動作のシミュレーション方法を提供する。
【００１１】
ここで、価値関数とは、学習の指標に用いる関数であり、強化学習法において用いられる評価関数の一種である。プラント実機運転データとは、実際のプラントが稼動している状態における操作量と状態量を含むデータである。操作量とは、プラントを運転する際に調整されたり変更される各種の操作対象となる量をいう。例えば、ごみ焼却炉では、詳細は後述するが、ゴミの投入量などである。モデル計算とは、操作量に応じて計算によって状態量を求める計算であり、物理的な理論計算モデル式、数値のフィッティングによる経験式、作業仮説に基づく理論式等に基づくものである。何らかのパラメータによってその値を調整することができる。状態量とは、プラントを運転している際の監視項目となる数値をいう。例えば、炉のある位置での温度などである。コンピュータ内部で学習をする主体を具体的に考慮する場合には、「エージェント」という主体を考える。これは、強化学習法の分野で一般に用いられる意味での学習主体としてのエージェントをコンピュータ内に実現したものであるが、後にシミュレーションを行なう主体としても動作するものである。これは、演算手段や記憶手段からなるハードウエア単体としての機能ではなく、コンピュータの主記憶装置に少なくとも一部が保持されている仮想空間に実現される機能であり、ソフトウエアとハードウエアの協働するものとして動作するものである。報酬とは、学習の指標に用いる価値関数を状態の更新に伴って書き換える際に加算され、その状態に対して割り付けられ、学習するエージェントに与えるインセンティブを表わす量である。パラメータ空間とは、パラメータの値のとりうる数学的な空間である。このパラメータは、計測データを近似する式のパラメータである。本願の発明は、一般に、計測データに対応するこのパラメータ空間を強化学習法における「環境」としてエージェント機能に学習をさせる。ここで、コンピュータは、少なくとも演算手段と、記憶手段と、入力手段と、出力手段とを有している。即ち本発明は、演算手段と記憶手段と入力手段と出力手段を備えるコンピュータにおいて、上記（ａ）〜（ｆ）のステップを実行する。
【００１２】
このシミュレーション方法によれば、コンピュータ（エージェント）にこのプラントの操作量と状態量に見られる振舞いをさせることができ、コンピュータを用いて実際のプラントをシミュレートすることが可能となる。
【００１３】
また、本発明では、（ｇ）仮想操作量を受付けるステップと、（ｈ）前記学習されたパラメータに基づき、該仮想操作量に応じて前記モデル計算を行なうステップと、（ｉ）モデル計算で得られた状態量を出力するステップとをさらに含むプラント動作のシミュレーション方法とすることができる。
【００１４】
仮想操作量とは、シミュレーションにおいて、例えばユーザーが実際のプラントへの入力であるかのように入力値として用いる値や、他のコンピュータが本方法を実行するコンピュータをプラントと見立てて操作量として送信してくる値である。ユーザーによる入力であれば、キーボード、マウスなどの入力装置から入力する。モデル計算は、その時点での学習の結果を反映させており、強化学習によってプラントの実機をシミュレートするようなモデル計算である。その結果得られた状態量は、プラント実機を実際に運転する代わりに本シミュレーション方法を用いて出力される。この出力は、ユーザーに対して表示装置に出力されるものや、他のコンピュータに出力される。これにより、本発明のシミュレータを、オペレータによるプラントの運転の訓練に用いることが可能となる。
【００１５】
本発明では、前記予め準備されたプラント実機運転データが経時的データであり、（ｊ）異なる時刻における学習後の前記方策に基づくパラメータの値を比較して、該比較した結果を出力するステップをさらに含むプラント動作のシミュレーション方法とすることができる。
【００１６】
経時的データとは、複数の時刻における操作量と状態量のデータである。学習後の方策は、その時点での最適なパラメータを与えるものであるため、異なる時刻でのプラントのモデルのパラメータについての最適値が得られている。これらを比較することにより、プラントモデルのパラメータについての時間的な比較をすれば、プラントの時間的な状態量の変化をシミュレートすることができる。しかも、このパラメータについての最適値の異なる時刻でのデータは、プラントの現実の現象を反映するものとなる。これによりモデルのパラメータを用いてプラントの状態の解析を行なうことができる。
【００１７】
本発明においては、前記価値関数は、所与の状態においてある行動に対して将来期待できる収益を価値として、その状態ｓにおいて行動ａを採用する価値を状態ｓと行動ａの関数である行動価値関数であり、これにより、前記強化学習をＱ−Ｌｅａｒｎｉｎｇ法によって行なうものとすることができる。強化学習法における評価関数となる価値関数を、状態ｓと行動ａに基づく行動価値関数Ｑ（ｓ，ａ）とすれば、Ｑ−Ｌｅａｒｎｉｎｇ法を行なうことができる。
【００１８】
また、本願において仮想操作量入力手段とは、例えば適当なコンピュータ端末に備えられた入力手段であり、訓練を受けるオペレータなどからの入力を受け付ける入力手段である。また、シミュレーテッド状態量出力手段とは、適当なデータ出力手段や表示手段であり、例えば上記仮想操作量入力手段を用いている訓練を受けているオペレータにプラントの運転状態であるかのように表示する表示手段である。
【００１９】
【発明の実施の形態】
以下図面を参照して本発明の実施の形態について説明する。
【００２０】
［実施の形態１］
本実施の形態では、プラントの動作をシミュレートするシミュレータについて説明する。
【００２１】
（実際のプラントの概要）
図１に、本発明のシミュレータを使用する状況について説明する。ごみ焼却炉１は、運転に相当の訓練を要する実際のプラントの一例である。このようなプラントは、オペレータ（運転員）２１が様々な操作を行なうことによって運転される。通常、オペレータ２１が操作するのは、ごみ焼却炉１に接続されたプラント運転装置２に接続されているオペレータコンソール２２である。プラント運転装置２は、ごみ焼却炉１の運転に必要な様々な状態量を、オペレータコンソール２２を通じてオペレータに提示し、オペレータはごみ焼却炉１の状態を把握してごみ焼却炉１の状態に応じて何らかの操作量の設定を変更して適切にごみ焼却炉１を運転する。
【００２２】
監視センタ４は、専用回線等のネットワーク３でプラント運転装置２に接続されてごみ焼却炉１の状態をモニターしており、プラントの運転管理を支援するサービスを行なうために設置されている。このため、監視センタ４は、ごみ焼却炉１の様々な操作量、状態量を収集できる。監視センタ４は、ごみ焼却炉１の運転管理を支援するために、リスク予測５、運転診断６、異常故障診断７、余寿命予測８、運転訓練９といった機能を備えている。
【００２３】
（シミュレータの概要）
図２に、運転訓練９を行なう運転訓練装置の場合に即してシミュレータの構成を説明する。オペレータコンソール２２は、プロセスシミュレータ９２となるコンピュータにネットワーク３を通じて接続されている。プロセスシミュレータ９２は、例えば監視センタ４に配置されるが、ネットワーク３により接続可能であれば、その場所は問わない。プロセスシミュレータ９２にはプラント実機運転データファイル９４が備えられている。このプラント実機運転データファイル９４はプロセスシミュレータ９２からアクセス可能ないずれの場所にあっても構わない。このプラント実機運転データファイル９４にはごみ焼却炉１の操作量と状態量の測定データがネットワーク３を通じて時系列に従って蓄えられている。
【００２４】
（実測データの内容）
操作量と状態量の実測データは、実際のごみ焼却炉では、様々な数値である。操作量の例としては、ゴミの投入量に関係するフィーダー速度、送風ファンのダンパー開度、一次空気温度、一次空気圧力、排煙ダンパー開度などであり、人為的に操作される量である。また状態量は、例えば、炉内温度、炉内圧力、排ガス温度、排ガス量、排ガス圧力、排ガス成分（酸素量や一酸化炭素量など）等であり、プラントの状態を表わす量である。その他にも、天候によって定まる周囲環境の状態を表わす量（例えば、気温、湿度）等も状態量や操作量となりうる。これらは直接プラントの状態を表わしているものではなく、また積極的に操作するものではないので、ここではあえて考慮しないが、これらをシミュレーションに加えることも可能である。実際のゴミ焼却炉では、これらの数値データが時々刻々変化しながら運転が行なわれ、その実測データをプラント実機運転データファイル９４に蓄える。
【００２５】
ここで、実際のごみ焼却炉では、これらの操作量と状態量の関係は、関数的な関係はあるものの、状態量がその時点での操作量にのみ依存するものではない。例えば、ある時刻の状態量は、その後の状態量に対しての初期値として作用してその後の時刻の状態量に影響する。また、ある時刻の操作量は、例えば一次遅れ要素に対する入力量のように、一定の遅れを伴ってその後の状態量に影響する。さらに、操作量と状態量の間には、必ずしも決定論的な関係があるものでもない。なぜなら、操作量として操作されるものが、必ずしもその操作量に完全に対応するものではなく、一定の幅をもって操作されるものであるためである（例えば、ゴミ投入量に関係するフィーダー速度を一定としても、それによる実際のゴミの投入量は常に一定量とはならない）。また、状態量には、気候条件（空気中の温度、湿度）や、ゴミの性状（ゴミの種類や成分、含まれる水分量等）も影響するためでもある。さらには、工学的には確率的な現象として扱わざるを得ない現象（例えば、燃焼のプロセス）も影響することも理由の一つである。
【００２６】
（エージェントの動作）
図２エージェント９６は、強化学習機能９６２を有し、強化学習法に従って環境に応じてエージェント９６自身の状態を変化させてゆく。エージェント９６の実体は、コンピュータ上に存在する仮想のものであり、エージェント９６自身の状態は何らかのパラメータによって変更される。このエージェント９６は、この強化学習をプラント実機運転データに基づいて行なう。本実施の形態では、エージェント９６の状態をプラントの動作を数値の入出力関係によって表現しうるような数式を含むモデルによって定める。本発明全体には、これ以外にも、数式を含むモデルのほか、数式を含まない数値のみのモデル（例えば操作量データのベクトルと状態量データのベクトルとの対応関係を示す単なる行列）等も含む。本実施の形態ではこの数式モデルを、プロセスモデル９６４と呼ぶ。プロセスモデル９６４が何らかの調整可能なパラメータを含んでいることにより、プロセスモデル９６４を実際のプラントに合わせて調整することができる。プロセスモデル９６４の変更は、プラントモデルデータファイル９８にあるプラントモデルを特徴付ける数値（パラメータ）を変更することによって行なうことができる。この調整動作は、エージェント９６自体の状態を変更することにあたり、プロセスモデル９６４を調整するパラメータ空間が本実施の形態においてエージェント９６が強化学習法に従って学習を行なう環境となる。
【００２７】
（シミュレータの構築方法）
図３に強化学習法を用いて行なう本実施の形態のシミュレータの構築方法について説明する。シミュレータの構築は、エージェント９６に強化学習をさせることによってコンピュータを用いて行なう。
【００２８】
まず、学習の最初には、プロセスモデルの作成と、初期化とを実行する（ステップＳ１）。プロセスモデルは、様々な物理現象を考慮して、その特徴を端的に表わすモデル式によって行なうことができる。図３には一次遅れ要素の伝達関数が記載されているが、他にも、自己回帰モデルによる近似式、燃焼の乱流の効果を確率密度関数で表現したモデルなど、何らかの理論的考察や、作業仮説に基づいて作成したモデルを任意に用いることができ、複数の物理現象の結合として表現したモデルであっても良い。また、モデルが単純で実際にプラントで起きている複雑さを再現できないときには、適当な確率項を加えることによって実際のプラントに見られる不可避な変動を再現することも可能であるが、このような確率項は、学習段階においては特に考慮する必要はない。操作量と状態量の関係が表現可能なモデルを本実施の形態では用いるが、本発明全体としては、モデル化できないものであっても、数値表現可能な入出力関係として記述できる関係さえあればよい。
【００２９】
初期化とは、エージェント９６を初期状態にすることと、後に使用する行動価値関数Ｑを初期状態にすることである。エージェント９６の状態は、エージェント９６の動作を決めるパラメータによって定まる。例えば、ゴミの処理量を操作量とし、炉内温度を状態量とするような図３の一次遅れ要素をモデルとして用いるのであれば、時定数ＴとゲインＧの値の組でエージェント９６の状態は定まる。この段階で、使用するモデルについて考慮するパラメータの範囲やその値の刻み幅もこの段階で定めておく。
【００３０】
次いで、エージェント９６に学習をさせるためのデータをプラント実機運転データファイル９６から適宜サンプリングする（ステップＳ２）。サンプリングするのは、強化学習させる環境として十分な精度のデータがあれば十分だからである。
【００３１】
次に、その時点のプロセスモデルによってモデル計算を実行する（ステップＳ３）。通常、サンプリングしたプラントの操作量に応じて実測の状態量が得られていることから、実測と同じ操作量に対して、その時点でのプロセスモデルに基づいて、計算によって状態量を算出する。
【００３２】
次に、報酬を計算する（ステップＳ４）。このためには、実際のプラントで得られた操作量と状態量の組に対し、その操作量と上記計算による状態量を考え、その操作量と計算の状態量との組を考える。実際のプラントの状態量とエージェント９６の出力する状態量とには、同じ操作量に対するものであっても、一般に差が生じる。この差は、モデルが不完全であること（単純すぎること、あるいはパラメータ設定が最適化されていないこと）のほかにも、実際のプラントでの操作量の精度限界や、プラントの動作の確率的な要素や揺らぎ的な要素、あるいは、状態量の測定の誤差等を含んだものである。報酬は、例えば実測と計算の状態量の間にあるの差（残差）に応じて定めることができる。例えば、サンプリングされた実測データの全てに対して上記残差の絶対値を取り、
式１
ｒ＝Ｃ―｜計算の状態量―実際の状態量｜
（Ｃ：正の定数）
によって各サンプリングデータごとにそれぞれ報酬要素ｒを定めることができる。その後の学習に必要な報酬のデータは、パラメータ空間におけるパラメータのある範囲（本例では、時定数ＴとゲインＧの値のとりうる範囲）についての報酬であるので、この範囲に含まれるパラメータの値の組についての報酬を求める必要がある。各パラメータの値においては、例えば、上記報酬要素ｒを、そのパラメータの値の組に含まれるものについて和を取り、データ数が多いパラメータの報酬が見かけ上大きい数値となることを防ぐために、データ数で除して正規化する。この例以外でも報酬を適宜定めることは可能であり、エージェント９６に対して計算と実際との差を表わすような適当な数値とすることができる。
【００３３】
そして、パラメータ空間の各パラメータに対して計算された報酬に基づいて、強化学習法の一種であるＱ−Ｌｅａｒｎｉｎｇを行なう（ステップＳ５）。ここで、Ｑ−Ｌｅａｒｎｉｎｇを採用する理由は、エージェントの状態に加えて、エージェントがとる行動についても学習の対象となるためである。行動も学習の対象となることにより、例えば、最適化計算に対して相対的に早く変化するプラントの実データに対しても追随が良好となり、時間的な遅れが少なくシミュレータを構築することができる。このため、本実施の形態では、Ｑ−Ｌｅａｒｎｉｎｇを用いているが、本発明では、ＴＤ学習法など、他の強化学習法を用いても良い。例えば、ＴＤ学習法を用いれば、その時刻でのエージェントの状態に基づいて次の時刻の状態を定めるために、行動が評価されず、プラントの実データが早い場合には時間差によって誤差を生じる場合があるが、行動の数が多い場合（例えば、高次の多項式でモデル式を作る場合などパラメータ数が多い場合）については、計算量が削減できて学習の繰り返しを増やすことができるために、誤差が小さくなる場合もある。また、強化学習の他の例であるＡｃｔｏｒ−Ｃｒｉｔｉｃ法によれば、ＴＤ学習法と同様に、計算量の削減が可能となるほか、確率的な行動選択が可能となる利点がある。
【００３４】
Ｑ−Ｌｅａｒｎｉｎｇにおいては、ある状態ｓと、そのｓにごとに定まる行動ａに対して行動価値関数Ｑ（ｓ，ａ）を考える。この行動価値関数Ｑ（ｓ，ａ）とは、状態ｓと行動ａを用いて評価値を得るＱ−Ｌｅａｒｎｉｎｇ法における評価値（価値）である。本例についていえば、状態ｓとは、時定数ＴとゲインＫの２次元空間においてエージェント９６がその時点で取っている状態である。Ｑ（ｓ，ａ）は初期値は例えば０とするが、一般には任意の値とすることができる（ステップＳ０）。
【００３５】
（Ｑ−Ｌｅａｒｎｉｎｇについて）
Ｑ−Ｌｅａｒｎｉｎｇを開始すると（ステップＳ５０）、まず、その時点での方策πに基づき、状態ｓにおいて確率的に行動ａを決定する（ステップＳ５２）。これにより次の状態ｓ´が定まる。ここで、方策πは状態ｓと行動ａの関数であり、この方策πが複数の行動を許す場合には、適当な乱数を用いて確率的にそれらの複数の行動から一つを選ぶ。状態ｓからの行動ａの行動評価関数Ｑ（ｓ，ａ）は、その後の状態ｓ´において取りうる行動ａ´のうちの最大のＱの値によって再計算される。
【００３６】
次に、Ｑを式２にしたがって更新する（ステップＳ５４）。このとき、状態ｓにおける報酬ｒと、割引率γ（予めステップＳ０で定める０以上１未満の値）、学習率ａ（予めステップＳ０で定める０より大きく１以下の値）を用いる。
式２
Ｑ（ｓ，ａ）←（１−α）Ｑ（ｓ，ａ）＋α［ｒ＋γｍａｘＱ（ｓ´，ａ´）］
【００３７】
行動ａの例は、時定数Ｔ、ゲインＧにおいて、現在の状態の点から上下左右斜めの８方向に移動可能とすると、その８方向のいずれかの新たな状態に移るという行動である。また最大値（ｍａｘ）は、状態ｓ´についてとりうる行動ａ´のうちの最大値である。
【００３８】
上記行動ａに従って遷移した結果新たに状態がｓ´になると（ステップＳ５６）、状態ｓの行動ａに付いて行動評価関数Ｑが式２に従って強化されるので、これを繰り返すことにより（ここでの繰り返しについては、図示していない）、与えられた実測データのもとで最適な状態が求まる。割引率γは０以上１未満に選べば、繰り返してもＱの値が発散することはない。これにより、繰り返しを用いて、最適なパラメータが求まることとなる。
【００３９】
ここで、最適化が実際に行なわれて強化学習が完了しているかどうかは、状態が遷移しなくなったことで判定する。行動ａには、「状態を遷移させない」という行動も含まれるため、状態を遷移させない行動が最適であれば、その時点での最適な行動となる。このような状態ｓのパラメータの組（本例では時定数Ｔ，ゲインＫ）は、適宜プラントモデルデータファイル９８（図２）に格納される。エージェントは、常に、このプラントモデルデータファイル９８からパラメータを呼び出すことにより、そのパラメータが作製された時点でのプラントの動作を再現し得る。これにより、最適な状態（パラメータの値の組）がもとまり（Ｓ５６）、Ｑ−Ｌｅａｒｎｉｎｇのステップが終了する（Ｓ５８）。
【００４０】
以上のようにしてモデルに用いる最適なパラメータが求まるが、この最適化を行なった後に、プラント実機運転データが更新されると、新たにサンプリングを行い、再び上記プロセスを実行する。
【００４１】
本実施の形態では、新たにプラント実機運転データが更新される場合についても、その更新されたデータを用いてモデルのパラメータを随時学習させることができる。これは、強化学習法自体が、経験的に学習を行なっていく学習法であり、逐次的にデータが更新等されても対応し得るからである。本実施の形態では例示のため２つのパラメータのみによる最適化を示したが、上記のごみ焼却炉の実測データの例に示したように、実際のプラントでは非常に多くの操作量および状態量がある。より複雑でパラメータの多い式で最適化計算する必要がある実際のプラントをシミュレートしようとすると、本発明の上記利点は極めて有効である。
【００４２】
また、このような利点をもたらす強化学習法のうち、Ｑ−Ｌｅａｒｎｉｎｇを採用すると、パラメータ空間における行動自体が評価対象となるために、強化学習の繰り返しステップにおいて、実機データに対し、プロセスモデルの挙動が実機に近い挙動を示すように調整され、より実際に近いシミュレータの構築が可能となる。
【００４３】
［実施例１］
本実施例では、本発明のシミュレータによって実際のプラントをシミュレートすることにより、プラントの運転訓練装置を構成する形態について説明する。訓練を受けるオペレータは、図２のオペレータコンソール２２（仮想操作量入力手段、シミュレーテッド状態量出力手段）から、プロセスシミュレータ９２中のエージェント９６に対して、ごみ焼却炉１を操作するのと同様の信号を送信する。エージェント９６は、プラントモデルデータファイル９８から呼び出したパラメータに応じて動作が設定されており、オペレータコンソール２２からの信号に対して、ごみ焼却炉１の振舞いをシミュレートする信号を出力する。
【００４４】
オペレータコンソール２２には、あたかも実際のごみ焼却炉１の運転結果であるかのように、プロセスシミュレータ９２のエージェント９６からの出力が表示される。これにより、実際のごみ焼却炉１を運転することなく、オペレータを訓練することが可能となる。
【００４５】
ここで、実際のプラントの状態が揺らぎを有している場合について説明する。揺らぎは、確率的な振舞いは、操作量の実際の値が実際に把握しきれないもの、現象そのものが変動してしまうものがあるが、その揺らぎの分布と時間的な性質（時間的な変動の性質）によって特徴付けられるものがほとんどである。例えば、１／ｆ揺らぎ等のスペクトル特性を示す現象に対して、長時間での累積データを確率密度関数（例えば正規分布など）で表現することが可能である。これ以外にも、ある時点で性質がステップ的に変動する事象として、その変動のステップの幅に正規分布を仮定し、変動事象の生起確率にポアソン分布を仮定することも可能である。このように適当に数学的に確率事象としてモデリングされる性質を、プラントの操作量（例えば、ゴミの性状）に与えたり、プラントのモデルパラメータに与えたり、あるいは、プラントの状態量に与えることができる。
【００４６】
このように適宜実際のプラントの揺らぎまでの加えてシミュレーションすると、より実際のプラントに近く、オペレータに適切な訓練を行なうことができる。なお、訓練を目的として、この確率を実際の確率とは異なる確率に設定し、訓練の効果を高めるように用いることもできる。
【００４７】
［実施例２］
本実施例では、本発明のシミュレータと組み合わせて異常診断装置を構成する形態を説明する。実際のプラントをシミュレートすることにより得られた時定数Ｔ、ゲインＧ等のパラメータは、エージェント９６の学習後の状態を定めるのみならず、実際のプラントの状態を反映している。このデータは、プラントモデルデータファイル９８に格納されているため、この値の変動をモニターすることで、プラントの運転状態についての情報が得られる。通常の運転では表面化しにくいようなプラント内部の状況を、間接的にではあるものの、監視することができる。これにより、測定可能な状態量以外を用いて、プラントの操業中であってもプラントの異常を診断することができる。
【００４８】
［実施例３］
本実施例では、本発明のシミュレータと組み合わせて運転診断装置を構成する例を説明する。運転診断装置とは今後の運転を検討する装置である。つまり、ある時点で得られているプラントの実測データと、それに基づく上記実施例２の異常診断装置から得られるプラント内部の状態とに合わせて、それ以降のプラントの運転について、運転計画を立てることに役立つ。
【００４９】
プラントの時定数ＴやゲインＧの変化と、操作量や状態量との関係を明らかにすることにより、プラント内部の状況と外部から操作したり測定できる状態量との関係が明らかになる。この関係から、そのプラントにとって最も適した運転方法を与えるような操作量の条件を割り出せば、運転方法の良否を状態量によってのみ管理する場合に比べてより実際を良く反映した運転方法の判定、つまり運転診断が可能となる。これを行なうには、プラントのモデルパラメータの最適値を、予め数値計算などにより求めておくステップと、本発明の装置で実際の運転状況におけるモデルパラメータの値を得るステップと、さらにモデルパラメータの最適値とモデルパラメータの値とを比較するステップを用いる。
【００５０】
［実施例４］
本実施例では、ごみ焼却炉において時間的に磨耗等により炉の厚みが減少する場合に、その炉の内部と外部の温度差を状態量として計測する。さらに、その状態量時間に対して補間式で表現しておいてそのカーブ上の値の空間を可変パラメータ空間とする。本発明のシミュレータによって随時データを更新しながら実際のプラントをシミュレートすることにより、その傾きの経時変化が強化学習の結果として求まる。そのカーブ上での傾きは、実際のプラントにおける過去の運転履歴における炉の厚みの減少速度を表わしているので、同様の運転を続けた場合の炉の寿命に関しても予測することができる。つまり、プラントの炉の寿命が解析できることになる。また、シミュレータが適切に動作する範囲において、仮想操作量を様々に変更してその傾きの変化を見ることにより、炉の寿命がどのように運転状況に依存するかを解析して、寿命にあわせた運転方法の選択をすることが可能となる。
【００５１】
このように、本発明の状態量と操作量の測定データを経時的なものとし、プラントの状態量に式を用いれば、プラント経年変化や残りの耐用年数についての解析が可能となる。
【００５２】
【発明の効果】
プラントにおけるプロセスシミュレーションに強化学習法を用いることにより、逐次的に運転データを反映させることができる。これにより、長期的な燃料の性状変化や経年変化も考慮した実際のプラントの挙動を学習するシミュレータを作製できる。また、各プラント、各焼却炉毎にプロセスモデルを持ち、それぞれに対し強化学習法により学習を行うことにより、各プラント、各焼却炉毎の癖を考慮したその時点での挙動を忠実に模擬できる。その結果、ごみ焼却炉の複雑な挙動をシミューレトした運転の訓練を行なうことが可能となる。また、このシミューレタを用いれば、最適な運転方法の検討や、リスクを最小化した最適化された運転方法を事前に検討し、プラントの挙動をシミュレートすることができる。また、随時実測データを反映させることができてプラントの挙動を解析することができる解析ツールを得ることができる。
【図面の簡単な説明】
【図１】本発明のシミュレータを使用する状況を説明する構成図である。
【図２】本発明の実施の形態における運転訓練を行なう場合における、シミュレータの構成を表わす構成図である。
【図３】強化学習法を用いて行なう本実施の形態のシミュレータの構築方法を説明するフローチャートである。
【図４】強化学習法の一例であるＱ−Ｌｅａｒｎｉｎｇ法の学習方法を説明するフローチャートである。
【符号の説明】
１ごみ焼却炉
２１オペレータ
２２オペレータコンソール
２プラント運転装置
３ネットワーク
４監視センタ
９２プロセスシミュレータ
９４プラント実機運転データファイル
９６エージェント
９８プラントモデルデータファイル
９６２強化学習機能
９６４プロセスモデル[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a plant simulator, an operation method, and a program thereof. In particular, the present invention relates to a simulation method for simulating (simulating) the operation of a plant such as a refuse incineration plant or the like, which requires skilled operator operation, or a plant whose state changes due to long-term operation.
[0002]
[Prior art]
In a refuse incinerator, the refuse to be burned (waste) is composed of various ratios according to its properties. Such variation in the components of the garbage is particularly remarkable when the waste is general waste such as household waste. For this reason, the behavior of the refuse incinerator itself is complicated. In addition, because of the influence of the dispersion of the garbage components and the habit of the operator's operation, the garbage incinerator has a habit in the behavior even if the plant changes or every incinerator in the same plant. There are many. Therefore, considerable skill is required to reliably operate such a waste incinerator. Even in other conventional plants other than refuse incinerators, there are those requiring skill in operation.
[0003]
In other words, in such a plant, the operator needs to be highly skilled because the operator must read and operate the relationship between a large number of controlled state variables by operating a large number of control variables.
[0004]
It is conceivable to use a training simulator to perform such plant operation training, but in order to construct a practical training simulator, it is the basis for simulating the complex behavior of the plant. A plant model is required. For example, Patent Literature 1 describes a method for generating a plant simulation model using a transfer function and PID control. Further, for example, Patent Literature 2 discloses a method of matching an actual plant operation with a plant simulator using an error.
[0005]
Further, in such a plant, there are those in which the state quantities (temperature, pressure, etc.) of the process are difficult to measure, and those in which complicated processes that are actually occurring cannot be fully grasped. In some cases, it is difficult to grasp the type and timing of maintenance required for the plant, or how much time is left until the end of its useful life. Due to these uncertain factors, it was necessary to set an operation period and perform maintenance, etc., with a large margin for safety.
[0006]
In the field of learning algorithms, a technique called reinforcement learning is known. The reinforcement learning method is one of unsupervised learning methods. In an environment, an autonomous agent, which is a learning subject, receives a reward or penalty obtained from the environment. Is a learning method that determines a policy or policy (policy) and maximizes the value that is the expected return (expected value of reward) given by the policy. The environment is complicated and uncertain. It has the feature that the agent can be learned even if it is present (for example, Non-Patent Document 1). As an example using this reinforcement learning method, Patent Document 3 discloses a method for optimizing the route of a dredger, but does not disclose an example used for a process model of a plant or the like.
[0007]
[Patent Document 1]
JP-A-7-64610
[Patent Document 2]
JP-A-10-207507
[Patent Document 3]
JP-A-10-253602
[Non-patent document 1]
The Institute of Electrical Engineers of Japan, GA / Neuro-based learning method and its application research expert committee, “Learning and its algorithm”, Morikita Publishing, August 28, 2002, p. 155-164
[0008]
[Problems to be solved by the invention]
In actual garbage incineration plants, in addition to exhibiting complex behavior depending on the nature of the garbage, if the plants are different, even if the plants are constructed in the same way, or if the incinerators are in the same plant, Also exhibit a different behavior (that is, a behavior in which the operation habit of the operator is multiplied by the habit of the plant and the incinerator), and the behavior also changes with the aging of the plant. When the behavior is complicated as described above, the complexity is not sufficiently represented by a simple model. In addition, the simulation of the behavior of the plant and the incinerator at the time of execution of the simulation under the aging of the actual machine has not been performed. In order to reflect such a state of the actual machine to the process model, it is necessary to sequentially analyze the operation of the plant in order to reduce uncertainties of the plant.
[0009]
[Means for Solving the Problems]
In order to solve such a problem, the present invention uses a reinforcement learning algorithm in an apparatus or the like for simulating the operation of a plant. In the present invention, a space corresponding to the operation amount and the state amount of the plant is set as an environment for reinforcement learning.
[0010]
That is, the present invention provides (a) a step of setting a value function to an initial state, and (b) a model using a process model created in advance for a certain manipulated variable using previously prepared actual plant operation data. Performing a calculation to obtain a state quantity; (c) calculating a reward using the manipulated variable, the calculated state quantity, and actual plant operation data; (d) steps (b) and (c). ) Is repeated for each of the plurality of parameters in the parameter space that defines the relationship between the operation amount and the state amount in the actual plant operation data; and (e) reinforcement learning based on the reward calculated for the plurality of parameters. Learning a measure for maximizing the profit that is the sum of the rewards by using the value function; and (f) obtaining the obtained value function. Used to provide a simulation method of plant operation comprising the steps of performing a simulation based on the learning parameters obtained.
[0011]
Here, the value function is a function used as a learning index, and is a kind of an evaluation function used in the reinforcement learning method. The actual plant operation data is data including an operation amount and a state amount in a state where an actual plant is operating. The operation amount is an amount to be variously adjusted or changed when the plant is operated. For example, in the case of a refuse incinerator, although details will be described later, the amount of refuse is input. The model calculation is a calculation for obtaining a state quantity by calculation according to an operation amount, and is based on a physical theoretical calculation model formula, an empirical formula by fitting numerical values, a theoretical formula based on a work hypothesis, and the like. Its value can be adjusted by some parameter. The state quantity is a numerical value serving as a monitoring item when the plant is operating. For example, the temperature at a certain position of the furnace. When specifically considering the subject that learns inside the computer, consider the subject called “agent”. This implements an agent in a computer as a learning subject in a sense generally used in the field of the reinforcement learning method, but also operates as a subject for performing a simulation later. This is not a function as a single piece of hardware consisting of arithmetic means and storage means, but a function realized in a virtual space at least partially held in a main storage device of a computer. It works as if it works. The reward is an amount that is added when a value function used as a learning index is rewritten in accordance with a state update, is assigned to the state, and represents an incentive given to a learning agent. The parameter space is a mathematical space where parameter values can be taken. This parameter is a parameter of an equation that approximates the measurement data. In general, the present invention causes the agent function to learn the parameter space corresponding to the measurement data as an “environment” in the reinforcement learning method. Here, the computer has at least a calculation unit, a storage unit, an input unit, and an output unit. That is, according to the present invention, the above-described steps (a) to (f) are executed in a computer including an arithmetic unit, a storage unit, an input unit, and an output unit.
[0012]
According to this simulation method, it is possible to cause a computer (agent) to behave as seen in the operation amount and state amount of this plant, and it is possible to simulate an actual plant using the computer.
[0013]
Also, in the present invention, (g) a step of receiving a virtual operation amount; (h) a step of performing the model calculation according to the virtual operation amount based on the learned parameter; And a step of outputting the obtained state quantity.
[0014]
The virtual manipulated variable is a value used as an input value in a simulation, for example, as if the user is an input to an actual plant, or transmitted as a manipulated variable by another computer assuming a computer that executes the method as a plant. Value. If the input is made by the user, the input is made from an input device such as a keyboard and a mouse. The model calculation reflects the result of the learning at that time, and is a model calculation that simulates an actual plant of the plant by reinforcement learning. The state quantities obtained as a result are output using the present simulation method instead of actually operating the actual plant. This output is output to the display device for the user or to another computer. Thus, the simulator of the present invention can be used for training of plant operation by an operator.
[0015]
In the present invention, the step (j) of comparing the parameter values based on the policy after learning at different times and outputting the result of the comparison is performed, wherein the previously prepared actual plant operation data is temporal data. A simulation method of a plant operation that further includes the above method can be provided.
[0016]
The time-dependent data is data of the operation amount and the state amount at a plurality of times. Since the policy after learning gives the optimum parameters at that time, the optimum values for the parameters of the plant model at different times are obtained. By comparing these parameters with each other, a temporal change in the parameters of the plant model can be simulated as a temporal change in the state of the plant. In addition, the data at the time when the optimum value of this parameter is different reflects the actual phenomenon of the plant. Thus, the state of the plant can be analyzed using the parameters of the model.
[0017]
In the present invention, the value function is defined as a value which is a function of the state s and the action a. Function, whereby the reinforcement learning can be performed by the Q-Learning method. If the value function serving as the evaluation function in the reinforcement learning method is an action value function Q (s, a) based on the state s and the action a, the Q-Learning method can be performed.
[0018]
Further, in the present application, the virtual operation amount input means is, for example, an input means provided in an appropriate computer terminal, and is an input means for receiving an input from an operator who receives training. The simulated state quantity output means is an appropriate data output means or display means, for example, as if the trained operator using the virtual manipulated variable input means is in the operating state of the plant. Display means for displaying.
[0019]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0020]
[Embodiment 1]
In the present embodiment, a simulator for simulating the operation of a plant will be described.
[0021]
(Outline of the actual plant)
FIG. 1 illustrates a situation where the simulator of the present invention is used. The refuse incinerator 1 is an example of an actual plant that requires considerable training for operation. Such a plant is operated by an operator (operator) 21 performing various operations. Usually, the operator 21 operates the operator console 22 connected to the plant operation device 2 connected to the refuse incinerator 1. The plant operation device 2 presents various state quantities necessary for the operation of the waste incinerator 1 to the operator through the operator console 22, and the operator grasps the state of the waste incinerator 1 and responds to the state of the waste incinerator 1. To operate the refuse incinerator 1 appropriately by changing the setting of some amount of operation.
[0022]
The monitoring center 4 is connected to the plant operating device 2 via a network 3 such as a dedicated line, monitors the state of the incinerator 1, and is installed to provide a service for supporting operation management of the plant. Therefore, the monitoring center 4 can collect various operation quantities and state quantities of the refuse incinerator 1. The monitoring center 4 has functions such as risk prediction 5, operation diagnosis 6, abnormal failure diagnosis 7, remaining life prediction 8, and operation training 9, in order to support the operation management of the refuse incinerator 1.
[0023]
(Overview of simulator)
FIG. 2 illustrates the configuration of the simulator according to the case of the driving training device that performs the driving training 9. The operator console 22 is connected to a computer serving as a process simulator 92 via the network 3. The process simulator 92 is disposed, for example, at the monitoring center 4, but may be located at any location as long as it can be connected to the network 3. The process simulator 92 includes an actual plant operation data file 94. The actual plant operation data file 94 may be in any location accessible from the process simulator 92. In the actual plant operation data file 94, measured data of the operation amount and the state amount of the waste incinerator 1 are stored in chronological order through the network 3.
[0024]
(Contents of actual measurement data)
The actual measurement data of the manipulated variables and state variables are various values in an actual waste incinerator. Examples of the operation amount are a feeder speed related to the amount of dust, a damper opening degree of a blowing fan, a primary air temperature, a primary air pressure, a smoke exhaust damper opening degree, and the like, which are artificially operated amounts. . The state quantity is, for example, a furnace temperature, a furnace pressure, an exhaust gas temperature, an exhaust gas amount, an exhaust gas pressure, an exhaust gas component (oxygen amount, carbon monoxide amount, and the like), and is a quantity representing a plant state. In addition, a quantity representing the state of the surrounding environment (for example, temperature and humidity) determined by the weather may be a state quantity or an operation quantity. Since these do not directly represent the state of the plant and do not actively operate, they are not considered here, but they can be added to the simulation. In an actual garbage incinerator, the operation is performed while these numerical data change every moment, and the measured data is stored in a plant actual machine operation data file 94.
[0025]
Here, in an actual refuse incinerator, the relationship between the manipulated variable and the state variable has a functional relationship, but the state variable does not depend only on the manipulated variable at that time. For example, the state quantity at a certain time acts as an initial value for the subsequent state quantity and affects the state quantity at the subsequent time. Further, the manipulated variable at a certain time affects a subsequent state variable with a certain delay, for example, an input variable for a first-order delay element. Further, there is not always a deterministic relationship between the operation amount and the state amount. This is because what is operated as the operation amount does not always completely correspond to the operation amount, but is operated with a certain width (for example, if the feeder speed related to the dust input amount is constant). Even so, the actual amount of garbage input will not always be constant.) This is also because the state quantity is influenced by climatic conditions (temperature and humidity in the air) and the nature of the garbage (the type and composition of the garbage, the amount of water contained, and the like). Further, one of the reasons is that a phenomenon (for example, a combustion process) which must be treated as a stochastic phenomenon in terms of engineering also influences.
[0026]
(Agent operation)
FIG. 2 The agent 96 has a reinforcement learning function 962, and changes the state of the agent 96 itself according to the environment according to the reinforcement learning method. The substance of the agent 96 is a virtual thing existing on the computer, and the state of the agent 96 itself is changed by some parameter. The agent 96 performs the reinforcement learning based on the actual plant operation data. In the present embodiment, the state of the agent 96 is determined by a model including a mathematical expression that can express the operation of the plant by a numerical input / output relationship. In addition to the models including mathematical formulas, the present invention also includes models including only numerical values not including mathematical formulas (for example, a simple matrix indicating a correspondence relationship between a vector of operation amount data and a vector of state amount data). Including. In the present embodiment, this mathematical expression model is referred to as a process model 964. Having the process model 964 include some adjustable parameters allows the process model 964 to be tuned to the actual plant. The process model 964 can be changed by changing numerical values (parameters) of the plant model data file 98 that characterize the plant model. In this adjustment operation, when changing the state of the agent 96 itself, the parameter space for adjusting the process model 964 becomes an environment in which the agent 96 learns according to the reinforcement learning method in the present embodiment.
[0027]
(How to build a simulator)
FIG. 3 illustrates a method of constructing the simulator of the present embodiment using the reinforcement learning method. The simulator is constructed using a computer by causing the agent 96 to perform reinforcement learning.
[0028]
First, at the beginning of learning, a process model is created and initialized (step S1). The process model can be performed by a model formula that expresses its characteristics in consideration of various physical phenomena. Although the transfer function of the first-order lag element is described in FIG. 3, other theoretical considerations such as an approximation formula using an autoregressive model, a model expressing the effect of turbulence of combustion by a probability density function, A model created based on the work hypothesis can be used arbitrarily, and may be a model expressed as a combination of a plurality of physical phenomena. When the model is simple and cannot reproduce the complexity that actually occurs in the plant, it is possible to reproduce the unavoidable fluctuations seen in the actual plant by adding an appropriate probability term. The probability term does not need to be particularly considered in the learning stage. In the present embodiment, a model capable of expressing the relationship between the operation amount and the state amount is used. However, as a whole, even if the relationship cannot be modeled, it can be described as an input / output relationship that can be expressed numerically. Good.
[0029]
The initialization is to set the agent 96 to the initial state and to set the action value function Q to be used later to the initial state. The state of the agent 96 is determined by parameters that determine the operation of the agent 96. For example, if the first-order lag element in FIG. 3 is used as a model in which the amount of garbage processed is the manipulated variable and the furnace temperature is the state variable, the state of the agent 96 is determined by a set of the time constant T and the gain G. Is determined. At this stage, the range of the parameter to be considered for the model to be used and the step size of the value are also determined at this stage.
[0030]
Next, data for causing the agent 96 to learn is appropriately sampled from the actual plant operation data file 96 (step S2). Sampling is performed because data with sufficient accuracy is sufficient for an environment for reinforcement learning.
[0031]
Next, model calculation is performed using the process model at that time (step S3). Usually, since the actually measured state quantity is obtained according to the sampled operation quantity of the plant, the state quantity is calculated by calculation based on the process model at that time for the same operation quantity as the actual measurement.
[0032]
Next, a reward is calculated (step S4). For this purpose, for a set of the operation amount and the state amount obtained in the actual plant, the operation amount and the state amount by the above calculation are considered, and a set of the operation amount and the calculation state amount is considered. In general, there is a difference between the actual state quantity of the plant and the state quantity output by the agent 96, even for the same operation quantity. This difference can be attributed to the incompleteness of the model (too simple or the parameter settings not being optimized) as well as the accuracy limits of the manipulated variables in the actual plant and the stochastic behavior of the plant. And fluctuation factors, or errors in the measurement of state quantities. The reward can be determined according to, for example, a difference (residual error) between the actual measurement and the calculated state quantity. For example, taking the absolute value of the residual for all of the sampled measured data,
Equation 1
r = C- | Calculated state quantity-Actual state quantity |
(C: positive constant)
Thus, a reward element r can be determined for each sampling data. The reward data required for subsequent learning is reward for a certain range of parameters in the parameter space (in this example, a range in which the values of the time constant T and the gain G can be taken). A reward for the set of values needs to be sought. In the value of each parameter, for example, the above-mentioned reward element r is added to the sum of the parameters included in the set of parameter values, and data is added in order to prevent the reward of the parameter having a large number of data from becoming an apparently large numerical value. Normalize by dividing by a number. In addition to this example, the reward can be appropriately determined, and an appropriate numerical value representing the difference between the calculation and the actual value for the agent 96 can be used.
[0033]
Then, based on the reward calculated for each parameter in the parameter space, Q-learning, which is a type of reinforcement learning, is performed (step S5). Here, the reason why Q-Learning is adopted is that, in addition to the state of the agent, the action taken by the agent is also subject to learning. Since the behavior is also a target of learning, for example, it is possible to build a simulator with less time delay with good tracking of actual data of a plant that changes relatively quickly with respect to the optimization calculation. . For this reason, in the present embodiment, Q-learning is used, but in the present invention, another reinforcement learning method such as a TD learning method may be used. For example, if the TD learning method is used, the action is not evaluated because the state of the next time is determined based on the state of the agent at that time, and when the actual data of the plant is early, an error occurs due to the time difference. However, when the number of actions is large (for example, when the number of parameters is large, such as when a model formula is created by a higher-order polynomial), the amount of calculation can be reduced and the number of learning iterations can be increased. The error may be small. According to the actor-critic method, which is another example of reinforcement learning, there is an advantage that the amount of calculation can be reduced and stochastic action selection can be performed, similarly to the TD learning method.
[0034]
In Q-Learning, an action value function Q (s, a) is considered for a certain state s and an action a determined for each s. The action value function Q (s, a) is an evaluation value (value) in the Q-Learning method for obtaining an evaluation value using the state s and the action a. In this example, the state s is a state that the agent 96 is currently taking in the two-dimensional space of the time constant T and the gain K. The initial value of Q (s, a) is, for example, 0, but can be generally an arbitrary value (step S0).
[0035]
(About Q-Learning)
When Q-Learning is started (step S50), first, the action a is stochastically determined in the state s based on the policy π at that time (step S52). As a result, the next state s' is determined. Here, the policy π is a function of the state s and the action a. When the policy π allows a plurality of actions, one is stochastically selected from the plurality of actions using an appropriate random number. The action evaluation function Q (s, a) of the action a from the state s is recalculated based on the maximum Q value of the possible actions a 'in the state s'.
[0036]
Next, Q is updated according to equation 2 (step S54). At this time, a reward r in the state s, a discount rate γ (a value of 0 or more and less than 1 determined in advance in step S0), and a learning rate a (a value greater than 0 and 1 or less in advance determined in step S0) are used.
Equation 2
Q (s, a) ← (1-α) Q (s, a) + α [r + γmaxQ (s ′, a ′)]
[0037]
An example of the action a is an action of moving to a new state in any of the eight directions if the time constant T and the gain G can move in eight directions, up, down, left, and right, from the point of the current state. The maximum value (max) is the maximum value of the actions a 'that can be taken for the state s'.
[0038]
When the state changes to s' as a result of the transition according to the action a (step S56), the action evaluation function Q for the action a in the state s is strengthened according to the equation 2, and this is repeated (here, The repetition is not shown), and an optimum state is obtained based on the given actual measurement data. If the discount rate γ is selected to be greater than or equal to 0 and less than 1, the value of Q does not diverge even when it is repeated. As a result, an optimum parameter is obtained by using repetition.
[0039]
Here, whether or not the optimization has actually been performed and the reinforcement learning has been completed is determined based on the fact that the state has not transitioned. Since the action “a” does not change the state, the action a that does not change the state is the optimum action at that point in time. Such a set of parameters in the state s (the time constant T and the gain K in this example) is stored in the plant model data file 98 (FIG. 2) as appropriate. The agent can always recall the operation of the plant at the time when the parameter was created by calling the parameter from the plant model data file 98. As a result, an optimal state (set of parameter values) is determined (S56), and the step of Q-Learning ends (S58).
[0040]
The optimum parameters used for the model are obtained as described above. When the actual plant operation data is updated after the optimization, new sampling is performed and the above process is executed again.
[0041]
In the present embodiment, even when the actual plant operation data is newly updated, the model parameters can be learned at any time using the updated data. This is because the reinforcement learning method itself is a learning method in which learning is performed empirically, and can cope with a case where data is sequentially updated. In the present embodiment, optimization using only two parameters is shown for the sake of illustration. However, as shown in the above example of the actual measurement data of the refuse incinerator, in an actual plant, a large number of operation quantities and state quantities are large. is there. The above advantages of the present invention are extremely effective when trying to simulate an actual plant that needs to be optimized by a more complicated and parameterized formula.
[0042]
In addition, if Q-learning is adopted among the reinforcement learning methods that provide such an advantage, the behavior itself in the parameter space becomes an evaluation target. Is adjusted so as to exhibit a behavior close to that of the actual machine, and a simulator that is closer to the actual one can be constructed.
[0043]
[Example 1]
In the present embodiment, an embodiment in which a plant operation training apparatus is configured by simulating an actual plant with the simulator of the present invention will be described. The operator who receives the training uses the operator console 22 (virtual manipulated variable input means, simulated state quantity output means) of FIG. Send a signal. The operation of the agent 96 is set according to the parameters called from the plant model data file 98, and the agent 96 outputs a signal simulating the behavior of the incinerator 1 in response to a signal from the operator console 22.
[0044]
The output from the agent 96 of the process simulator 92 is displayed on the operator console 22 as if it were an actual operation result of the incinerator 1. This makes it possible to train operators without actually operating the incinerator 1.
[0045]
Here, a case where the actual state of the plant has fluctuation will be described. Fluctuations include stochastic behaviors in which the actual value of the manipulated variable cannot be actually grasped and in which the phenomenon itself fluctuates. The fluctuation distribution and temporal characteristics (temporal fluctuations) Most of them are characterized by: For example, it is possible to express cumulative data over a long period of time with a probability density function (for example, a normal distribution) for a phenomenon exhibiting spectral characteristics such as 1 / f fluctuation. In addition, as an event whose properties fluctuate stepwise at a certain time, it is also possible to assume a normal distribution for the width of the step of the fluctuation and a Poisson distribution for the occurrence probability of the fluctuating event. In this way, it is possible to give a property appropriately mathematically modeled as a stochastic event to a plant operation quantity (for example, garbage properties), a plant model parameter, or a plant state quantity. it can.
[0046]
By appropriately simulating the fluctuation of the actual plant in this way, it is possible to perform an appropriate training for the operator, which is closer to the actual plant. In addition, for the purpose of training, this probability can be set to a probability different from the actual probability and used to enhance the effect of training.
[0047]
[Example 2]
In the present embodiment, an embodiment in which an abnormality diagnosis device is configured in combination with the simulator of the present invention will be described. The parameters such as the time constant T and the gain G obtained by simulating the actual plant not only determine the state of the agent 96 after learning, but also reflect the actual state of the plant. Since this data is stored in the plant model data file 98, information on the operating state of the plant can be obtained by monitoring a change in this value. It is possible to monitor, albeit indirectly, a situation inside the plant, which is unlikely to surface during normal operation. Thus, the abnormality of the plant can be diagnosed using the state quantity other than the measurable state even during the operation of the plant.
[0048]
[Example 3]
In this embodiment, an example will be described in which a driving diagnostic device is configured in combination with the simulator of the present invention. The driving diagnostic device is a device for examining future driving. In other words, according to the actual measurement data of the plant obtained at a certain point in time and the internal state of the plant obtained from the abnormality diagnosis apparatus of the second embodiment based on the data, an operation plan is established for the subsequent operation of the plant. Help.
[0049]
By clarifying the relationship between the change in the time constant T or the gain G of the plant and the manipulated variable or state variable, the relationship between the state inside the plant and the state variable that can be operated or measured from outside is clarified. From this relationship, if the condition of the operation amount that gives the most suitable operation method for the plant is determined, the determination of the operation method that reflects the actual situation better than the case where the quality of the operation method is managed only by the state quantity, That is, driving diagnosis can be performed. In order to do this, a step of obtaining the optimum value of the model parameter of the plant by numerical calculation or the like in advance, a step of obtaining the value of the model parameter in an actual operating condition by the apparatus of the present invention, and a step of further optimizing the model parameter A step of comparing the value with the value of the model parameter is used.
[0050]
[Example 4]
In this embodiment, when the thickness of the waste incinerator decreases due to temporal wear or the like, the temperature difference between the inside and the outside of the furnace is measured as a state quantity. Furthermore, the state quantity time is expressed by an interpolation formula, and the space of values on the curve is defined as a variable parameter space. By simulating an actual plant while updating data at any time by the simulator of the present invention, a change over time of the slope is obtained as a result of reinforcement learning. Since the slope on the curve represents the rate of decrease in the furnace thickness in the past operation history of the actual plant, the life of the furnace when the same operation is continued can also be predicted. That is, the life of the furnace of the plant can be analyzed. Also, within the range where the simulator operates properly, the virtual operation amount is variously changed, and the change in the inclination is observed to analyze how the life of the furnace depends on the operating conditions. It is possible to select the operation method that is used.
[0051]
As described above, if the measurement data of the state quantity and the manipulated variable of the present invention is made time-dependent, and the equation is used for the state quantity of the plant, it is possible to analyze the aging of the plant and the remaining service life.
[0052]
【The invention's effect】
By using the reinforcement learning method in the process simulation in the plant, the operation data can be sequentially reflected. This makes it possible to manufacture a simulator that learns the actual behavior of the plant in consideration of long-term fuel property changes and aging. In addition, by having a process model for each plant and each incinerator, and learning with the reinforcement learning method for each, it is possible to faithfully simulate the behavior at that time taking into account the habit of each plant and each incinerator. . As a result, it is possible to perform operation training that simulates the complicated behavior of the refuse incinerator. Further, if this simulator is used, it is possible to simulate the behavior of a plant by examining an optimal operation method and an optimized operation method that minimizes risk in advance. Further, it is possible to obtain an analysis tool that can reflect measured data at any time and analyze the behavior of the plant.
[Brief description of the drawings]
FIG. 1 is a configuration diagram illustrating a situation in which a simulator of the present invention is used.
FIG. 2 is a configuration diagram illustrating a configuration of a simulator when performing driving training according to the embodiment of the present invention.
FIG. 3 is a flowchart illustrating a method for constructing a simulator according to the present embodiment using a reinforcement learning method.
FIG. 4 is a flowchart illustrating a learning method of a Q-Learning method, which is an example of a reinforcement learning method.
[Explanation of symbols]
1 Garbage incinerator
21 Operator
22 Operator console
2 Plant operation equipment
3 network
4 Monitoring center
92 Process Simulator
94 Plant operation data file
96 agents
98 Plant model data file
962 Reinforcement learning function
964 Process Model

Claims

(A) initializing the value function;
(B) using a previously prepared plant actual machine operation data to execute a model calculation with a process model created in advance for a certain operation amount to obtain a state amount;
(C) calculating a reward using the manipulated variable, the calculated state variable, and the actual plant operation data;
(D) repeating step (b) and step (c) for each of a plurality of parameters in a parameter space that defines a relationship between an operation amount and a state amount in actual plant operation data;
(E) learning, by using the value function, a measure for maximizing the profit, which is the sum of the rewards, by performing reinforcement learning based on the rewards calculated for the plurality of parameters;
(F) performing a simulation based on learned parameters obtained using the obtained value function.

(G) receiving a virtual operation amount;
(H) performing the model calculation according to the virtual operation amount based on the learned parameters;
The method according to claim 1, further comprising: (i) outputting a state quantity obtained by the model calculation.

The previously prepared plant actual machine operation data is time-lapse data,
The method of simulating plant operation according to claim 1, further comprising: (j) comparing values of parameters based on the policy after learning at different times and outputting a result of the comparison.

The value function is an action value function that is a function of the state s and the action a in which the value of adopting the action a in the state s is defined as the value of a future expected profit for a certain action in a given state. The simulation method according to claim 1, wherein the reinforcement learning is performed by a Q-learning method.

A program for causing a computer to execute each step according to claim 1.

A simulator comprising a computer having arithmetic means, storage means, input means, and output means,
Storing plant operation data prepared in advance in the storage means,
The calculating means is
Using the actual machine operation data, for each of a plurality of parameters in the parameter space that defines the relationship between the manipulated variable and the state quantity in the actual plant machine operation data, model calculation is performed using a process model created in advance for a certain manipulated variable. Executing to obtain a state quantity and repeatedly calculating a reward using the manipulated variable, the calculated state quantity, and the actual plant operation data, and enhancing based on the reward calculated for a plurality of parameters. A strategy for maximizing the profit that is the sum of rewards by performing learning is learned using the value function, and the learned policy is stored in the storage unit.
A plant operation simulator for executing a simulation by executing the model calculation in accordance with an input amount received by the input means, based on a parameter determined from the learned policy.

Virtual manipulated variable input means;
Simulated state quantity output means,
The computer calls the learned policy, executes the model calculation according to the virtual operation amount from the virtual operation amount input unit, based on a parameter determined from the learned policy,
7. The plant operation simulator according to claim 6, wherein training of an operator of the plant is performed by outputting a simulated state quantity corresponding to the virtual manipulated variable to the simulated state quantity output means.

The plant operation data prepared in advance is time-lapse data, and the values of parameters based on the policy after learning at different times are compared, and the result of the comparison is output to analyze the state of the plant. A plant operation simulator according to claim 6.