JP6025785B2

JP6025785B2 - Automatic speech recognition proxy system for natural language understanding

Info

Publication number: JP6025785B2
Application number: JP2014140729A
Authority: JP
Inventors: イエラカリスヨーヨス; ビー．カルスアルウィン; ラプシーナラリッサ
Original assignee: インタラクションズリミテッドライアビリティカンパニー
Priority date: 2013-07-08
Filing date: 2014-07-08
Publication date: 2016-11-16
Anticipated expiration: 2034-07-08
Also published as: JP2015018238A

Description

本発明は、対話式応答通信システムの分野に関し、より詳細には、発話を自動音声認識（ＡＳＲ）プロセッサ、人間音声認識（ＨＳＲ）リソース、またはＡＳＲ機構とＨＳＲ機構の両方に、選択的にルーティングする対話式応答通信システムに関する。 The present invention relates to the field of interactive response communication systems, and more particularly, selectively routes speech to an automatic speech recognition (ASR) processor, human speech recognition (HSR) resources, or both ASR and HSR mechanisms. To an interactive response communication system.

関連出願
本出願は、2011年1月5日に出願された「Automated Speech Recognition System for Natural Language Understanding」という名称の米国特許出願第12/985,174号明細書の一部継続出願である、2011年3月24日に出願された「Automated Speech Recognition Proxy System For Natural Language Understanding」という名称の、本願の所有者が所有する同時係属の米国特許出願第13/070,865号明細書（現米国特許第8,484,031号明細書）の一部継続出願であり、この出願に対して米国特許法第１２０条に基づき優先権を主張する。上記で参照した出願の内容は、本明細書に完全に記載されているかのように、参照により本明細書に組み込まれる。 Related Application This application is a continuation-in-part of US patent application Ser. No. 12 / 985,174, filed Jan. 5, 2011, entitled "Automated Speech Recognition System for Natural Language Understanding." No. 13 / 070,865 (currently US Pat. No. 8,484,031), owned by the owner of this application, named “Automated Speech Recognition Proxy System For Natural Language Understanding”, filed on May 24 And claims priority to this application under Section 120 of the US Patent Act. The contents of the above referenced applications are hereby incorporated by reference as if fully set forth herein.

多くの会社は、電子的手段（最も一般的には、電話、Ｅメールおよびオンラインテキストチャット）によって顧客と対話する。このような電子システムは、必要な顧客サービスエージェントまたはサポートエージェントの数を制限することにより、会社にとって多くの金銭を節約する。しかし、これらの電子システムが提供する顧客体験は一般に、満足には及ばないものである。顧客体験は、単純なトランザクションの場合には容認できることもあるが、顧客がコンピュータに話しかけることまたはコンピュータと対話することに熟達していない場合は、辻褄が合わないかまたは全くもどかしいものであることが多い。 Many companies interact with customers by electronic means (most commonly telephone, email and online text chat). Such electronic systems save a lot of money for the company by limiting the number of customer service agents or support agents required. However, the customer experience provided by these electronic systems is generally unsatisfactory. The customer experience may be acceptable in the case of a simple transaction, but if the customer is not proficient at talking to or interacting with a computer, it may be disappointing or totally frustrating Many.

このような対話式応答システムは、当技術分野で周知である。例えば、対話式音声応答（ＩＶＲ）システムを使用して電話を介して顧客サービスを提供することは、このようなシステムの１つである。ＩＶＲ技術を利用した顧客サービスシステムの例が特許文献１に記載されている。ＩＶＲシステムは通常、一連の録音済みフレーズを使用して顧客と通信し、いくつかの口頭入力およびタッチトーン信号に応答し、また、電話をルーティングまたは転送することができる。このようなＩＶＲシステムの欠点は、これらが通常は「メニュー」構造を中心に構築されていることであり、この構造は、一度にわずかな有効オプションしか発信者に提示せず、また、発信者からの狭い範囲の応答を必要とする。 Such interactive response systems are well known in the art. For example, providing customer service via telephone using an interactive voice response (IVR) system is one such system. An example of a customer service system using IVR technology is described in Patent Document 1. The IVR system typically communicates with the customer using a series of pre-recorded phrases, responds to some verbal input and touch tone signals, and can route or forward the phone. The disadvantage of such IVR systems is that they are usually built around a “menu” structure, which presents only a few valid options to the caller at a time, and the caller Requires a narrow range response from

これらのＩＶＲシステムの多くは今や、音声認識技術を組み込んでいる。音声認識技術を組み込んだシステムの例が特許文献２に記載されている。ＩＶＲシステムによって使用される音声認識技術の頑強性は様々だが、これらが聞こうとする、かつ理解できる応答は、所定範囲の応答であることが多く、このことは、エンドユーザが日常語でシステムと対話する能力を制限する。従って、発信者はしばしば、「コンピュータに話しかけているかのように」システムに話しかけることを余儀なくされているように感じることになる。さらに、音声認識を利用するシステムと対話しているときでも、顧客入力はしばしば認識されないかまたは間違って決定され、それにより顧客は、できるだけ早く人間の顧客サービスエージェントと接触することを求める。 Many of these IVR systems now incorporate voice recognition technology. An example of a system incorporating a speech recognition technology is described in Patent Document 2. The robustness of the speech recognition technology used by IVR systems varies, but the responses they are trying to hear and understand are often within a certain range of responses, which means that the end user can Limit your ability to interact with. Thus, the caller often feels forced to talk to the system “as if talking to a computer”. Furthermore, even when interacting with a system that utilizes speech recognition, customer input is often not recognized or incorrectly determined, thereby requiring the customer to contact a human customer service agent as soon as possible.

より入り組んだ顧客サービス要求のために、人間の顧客サービスエージェントが使用され続けている。これらのエージェントは、顧客に電話で話しかけ、顧客のＥメールに応答し、および顧客とオンラインでチャットすることができる。エージェントは通常、顧客の質問に答えるか、または顧客の要求に応答する。会社は顧客サービスグループを有するが、これらは「顧客リレーションマネジメント」を専門にする会社に外部委託されることもある。このような会社は、何百人ものエージェントがスタッフとして配置されるセンタを運営し、これらのエージェントは、１日の全勤務時間を、電話をして過ごすかまたは他の方法で顧客と対話して過ごす。このようなシステムの例が特許文献３に記載されている。 Human customer service agents continue to be used for more complex customer service requests. These agents can talk to customers over the phone, respond to customer emails, and chat with customers online. Agents typically answer customer questions or respond to customer requests. The company has customer service groups, which may be outsourced to companies specializing in “customer relationship management”. Such a company operates a center where hundreds of agents are staffed, and these agents spend the entire day of the work on the phone or otherwise interact with the customer. spend. An example of such a system is described in Patent Document 3.

顧客サービス対話の典型的なモデルは、１人のエージェントが、顧客対話の継続時間にわたって顧客を援助するものである。時には、顧客が複数の要求について助けを必要とする場合、あるエージェント（例えば技術サポート担当者）が顧客を別のエージェント（販売担当者など）に転送することもある。しかし一般には、１人のエージェントが、顧客の電話またはチャットセッションの全継続時間にわたってこの１人の顧客の援助に自分の時間を費やすか、または、Ｅメールを介して顧客の問題を解決することに専念する。また、ほとんどのコールセンタは、エージェントが通話のログをとる（記録を残す）ために時間を割くものと考える。この重いエージェントインタフェースモデルの欠陥は、（１）エージェント離職者率が高いこと、並びに（２）通例、多くの初期および継続的なエージェントの訓練が必要であることであり、これらの全てにより、顧客サービスは結局、これらの顧客サービス提供者にとってかなりの出費になる。 A typical model for customer service interaction is one agent assisting the customer for the duration of the customer interaction. Sometimes, if a customer needs help with multiple requests, one agent (eg, a technical support representative) may transfer the customer to another agent (such as a sales representative). In general, however, one agent spends his or her time helping the customer for the entire duration of the customer's phone or chat session or solves the customer's problem via email. To concentrate on. Also, most call centers consider that agents take time to log calls (leave a record). The deficiencies of this heavy agent interface model are (1) high agent turnover rates and (2) typically requiring many initial and ongoing agent training, all of which The service will ultimately be a significant expense for these customer service providers.

エージェント関連の出費を軽減するために、組織によっては、その顧客サービスニーズを外部委託する。高速光ファイバ音声データネットワークの急増に伴う、近年の米国における傾向の１つは、より低い労働コストを利用するために顧客サービスセンタを海外に配置することである。このような外部委託は、海外の顧客サービスエージェントが英語を流暢に話すことを必要とする。これらのエージェントが電話ベースのサポートに使用される場合、エージェントが英語ではっきりと理解し話すことができることがしばしば課題になる。海外への外部委託の不幸な結果は、サービスを求める人にとっての、誤解、および、満足に及ばない顧客サービス体験である。 To reduce agent-related expenses, some organizations outsource their customer service needs. One recent trend in the United States with the proliferation of high-speed fiber optic voice data networks is to place customer service centers abroad to take advantage of lower labor costs. Such outsourcing requires an overseas customer service agent to speak English fluently. When these agents are used for phone-based support, it is often a challenge that they can clearly understand and speak in English. The unfortunate consequence of outsourcing overseas is the misunderstanding and unsatisfactory customer service experience for those seeking service.

改良された対話式応答システムは、コンピュータによって実施される音声認識を、人間エージェントの断続的使用と一体化する。ある程度、これは何年も行われてきた。人間の係員と自動音声レコグナイザの両方を使用するシステムが扱われている（特許文献４）。同様に、ユーザの発話を人間によって解釈する必要のある通話部分のみが、人間エージェントに提示されるシステムが開示されている（特許文献５）。これらの特許の内容、並びに本明細書で言及する他の全ての技術は、本明細書に完全に記載されているかのように参照により本明細書に組み込まれる。このようなシステムの利益は、そのコストが比較的低ければ高まり、この低コストは一般に、限られた人間対話しか必要としないものである。このような限られた人間対話を達成するには、最小限の初期訓練しか必要とせず、時の経過に伴って結果が向上し続けるシステムを有することが望ましいであろう。特に、本番使用に適する「初日からの」性能をもたらし、時の経過に伴って効率が素早く向上する学習／訓練システムが、とりわけ価値があるであろう。 The improved interactive response system integrates computer-implemented speech recognition with intermittent use of human agents. To some extent, this has been done for years. A system that uses both a human clerk and an automatic voice recognizer is treated (Patent Document 4). Similarly, a system is disclosed in which only a part of a call that requires a user to interpret a user's utterance is presented to a human agent (Patent Document 5). The contents of these patents, as well as all other techniques referred to herein, are hereby incorporated by reference as if fully set forth herein. The benefits of such a system increase if its cost is relatively low, which generally requires only limited human interaction. To achieve such limited human interaction, it would be desirable to have a system that requires minimal initial training and whose results continue to improve over time. In particular, a learning / training system that provides “from day one” performance suitable for production use and that quickly increases in efficiency over time would be particularly valuable.

多くの既存のＡＳＲシステムは、システムの各特定ユーザの声を認識するように訓練される必要性、または妥当な結果を提供するために認識語彙を厳しく制限する必要性など、かなりの訓練制約を被る。このようなシステムは、ユーザによって人工的と認識されやすい。典型的な人間プロンプト「ご用件をどうぞ。」と、人工的なプロンプト「予約したい場合は「したい」と、予約状況を確認したい場合は「状況」と、予約をキャンセルするには「キャンセル」と言って下さい。」との間の相違を考察されたい。 Many existing ASR systems have significant training constraints, such as the need to be trained to recognize the voice of each particular user of the system, or the need to strictly limit the recognition vocabulary to provide reasonable results. suffer. Such a system is easily recognized as artificial by the user. Typical human prompt “Please do business”, artificial prompt “Would you like to make a reservation?” “Situation” to check the reservation status, “Cancel” to cancel the reservation. Please say. Please consider the difference between.

ＡＳＲ（自動音声認識）による音声システムの目標は、「２００１年宇宙の旅」の中のＨＡＬによく似た、発信者対話を実施するための会話システムを達成することであった。ＡＳＲ機能を改善するために、音声ユーザインタフェース（ＶＵＩ）技法が開発された。これにより、より高精度の音声認識を達成するために、使用される語彙を削減しようとして、かつ発信者が話さなければならない単語に関するヒントを発信者に与えようとして、プロンプトが正確かつコンパクトに表現される。それ以来、ＡＳＲは向上し、今や自由回答式の会話認識にも対処する。しかし、このような自由回答式の会話は、より多い語彙を必要とし、その結果、音声認識エラー率はずっと高くなる。結果的に、ＩＶＲシステムに対するより多くの不満および軽蔑の念が発信者に残る。これは例えば、前に何が述べられ理解されたかを過度に確認すること、間違った選択を行うこと、および発信者を前のメニューに戻らせることに基づく。ＶＵＩ設計は、会話を一般から特定に絞り込もうとして、発信者をいわゆる「ディレクテッドダイアログ」に導こうと試みる。小さい領域は、語彙が限られ発話レパートリが相対的に著しく小さいので、ＡＳＲおよびＮＬＵは、ディレクテッドダイアログに適用されたときは、より成功してきた。ＩＶＲ業界は、音声認識による統計および「探索」を使用して知識領域を特徴付けて、理解をさらに高めることに取り組んでいる。しかし、これらの手法はなお、かなりの数の発信者、特に、個人化されたＡＳＲ音響モデルを構築するなどの複雑な技法を使用しても理解が困難な方言または発音パターンを有する発信者を、うまく扱わない。人間援助型認識の登場に伴い、今や、自動化と共に人間の理解を活用して音声、テキスト、グラフィックス、およびビデオを認識する機会があり、これにより、理解がより正確になり、ＡＳＲベースのＩＶＲシステムの弱点の多くが回避される。ＩＶＲシステムの根本的なタスクは、ユーザ要求に対応する様々な用件フォーム中の情報スロットを埋めるのを調整することである。従来のＩＶＲシステムでは、この調整は通常、あらかじめ固定された決定木に従って実施され、ユーザと対話するための限られた数の方法からの逸脱はほとんどない。ＶＵＩ設計の変形や、正確な理解をうまく識別するために最適化する種々の基準や、可能な最短の時間で理解および認識する技法を含めた、種々の認識戦略が開発されてきた。 The goal of the speech system with ASR (automatic speech recognition) was to achieve a conversation system for conducting caller dialogue, similar to the HAL in “2001 Space Travel”. In order to improve ASR functionality, voice user interface (VUI) techniques have been developed. This allows prompts to be expressed accurately and compactly in an attempt to reduce the vocabulary used and to give the caller hints about the words that the caller must speak in order to achieve more accurate speech recognition. Is done. Since then, ASR has improved and now deals with open-ended conversation recognition. However, such free answer conversations require more vocabulary, resulting in a much higher speech recognition error rate. As a result, more dissatisfaction and contempt for the IVR system remains for callers. This is based, for example, on over-confirming what was previously stated and understood, making wrong choices, and allowing the caller to return to the previous menu. VUI design attempts to direct the caller to a so-called “directed dialog” in an attempt to narrow the conversation from the general to the specific. ASR and NLU have been more successful when applied to directed dialogs because small regions have a limited vocabulary and a relatively small utterance repertoire. The IVR industry is working to further improve understanding by characterizing the knowledge domain using statistics and "search" with speech recognition. However, these approaches still have a significant number of callers, particularly those with dialects or pronunciation patterns that are difficult to understand using complex techniques such as building a personalized ASR acoustic model. , Do not handle well. With the advent of human-assisted recognition, there is now an opportunity to recognize speech, text, graphics, and video using automation and human understanding, which makes the understanding more accurate and ASR-based IVR Many of the system weaknesses are avoided. The fundamental task of the IVR system is to coordinate filling in the information slots in various business forms corresponding to user requests. In conventional IVR systems, this adjustment is typically performed according to a pre-fixed decision tree, and there is little departure from a limited number of ways to interact with the user. Various recognition strategies have been developed, including variations in VUI design, various criteria that are optimized to successfully identify an accurate understanding, and techniques for understanding and recognizing in the shortest possible time.

発信者と、人間援助型認識を使用する自動システムとの間の対話をできるだけシームレスかつ自然なものにするために様々な適切な技法をシステムが使用することには、多くの理由がある。 There are many reasons why the system uses various appropriate techniques to make the interaction between the caller and the automated system that uses human-assisted recognition as seamless and natural as possible.

人間は、自動音声認識（ＡＳＲ）、グラフィックスおよびビデオ処理、並びに自然言語理解（ＮＬＵ）技法よりも、ずっと高い精度で意味を認識し解釈する。自動化の精度が不十分なときに人間を用いて理解することができるならば、かなり多くのユーザ対話を自動化しながらもなお、良いユーザ体験を提供することが可能になる。しかし、コンピュータリソースは、異常な予測されないボリュームピークを満たすようにスケールすることができるが、人間リソースは、そうしたコンピュータリソースとは異なり、スケジュールされる必要があり、ピークに合うようにタイミングよく利用可能ではないことがある。従って、精度が十分でないときにはＤＴＭＦ（dual-tone multi-frequency）も使用して、システムがどんな特定の適用例での必要ＨＳＲ量にも自動的に合わせ、それによりＨＳＲの使用を最小限に抑えることが必要とされている。予定外のピーク中に人間の対話が変化することになっても、より従来式のやり方でセルフサービスを実施し続けることができる。 Humans recognize and interpret meaning with much higher accuracy than automatic speech recognition (ASR), graphics and video processing, and natural language understanding (NLU) techniques. If it can be understood using humans when the accuracy of the automation is insufficient, it is possible to provide a good user experience while automating a large number of user interactions. However, computer resources can be scaled to meet an unusual and unpredictable volume peak, but unlike those computer resources, human resources need to be scheduled and are available in a timely manner to meet the peak It may not be. Therefore, DTMF (dual-tone multi-frequency) is also used when accuracy is not sufficient, so that the system automatically adapts to the amount of HSR required in any particular application, thereby minimizing the use of HSR. It is needed. If human interaction changes during unscheduled peaks, self-service can continue to be implemented in a more traditional manner.

目標は今や、どのように人間援助と自動化を組み合わせて発信者の発話を最もよく認識および解釈すると同時に可能な最も人間らしいユーザ体験を達成するかということになるが、音声認識を整調し、認識された発話を分類して、最高の認識レベルを達成するのに使用される従来技法は、微妙だが重要な形で変化する。従って、既存のシステムによって対処されない難題は、最もうまくいくユーザ体験を提供しながら、所与の作業負荷の下における所与の状況でどのように人間と自動化との最も効率的な組合せを使用するかである。 The goal is now how to combine human assistance and automation to best recognize and interpret the caller's utterances while at the same time achieving the most human-like user experience possible. Traditional techniques used to classify utterances and achieve the highest level of recognition vary in subtle but important ways. Thus, the challenges that are not addressed by existing systems use the most efficient combination of human and automation in a given situation under a given workload, while providing the most successful user experience It is.

従来、ＡＳＲシステムは、発話されるのに伴って、それを「聞く」のを開始する。認識自動化が失敗した場合は、ユーザは、完全な発話が話されるのにかかる時間長にわたって待機することになり、その後、ＨＳＲが、聞くのを開始してそれを処理することになる。そうではなく、システムがリアルタイムに近い形で対話を理解しようとすることができれば望ましいであろう。例えば、ユーザがどんどん単語を話してそれらの意味（または「意図」）を記述するのに伴い、まずＡＳＲによって処理され、次にＨＳＲによって処理される結果、発話の終わりと応答の始まりとの間にかなりの時間ギャップが生じる。この時間ギャップは、例えば、タイピング音などのオーディオ再生で埋めることもできる。これは適用例によってはうまくいく可能性があり、特に、データを収集する適用例ではうまくいく可能性がある。他の適用例では、この時間ギャップにより、システムと自然な会話を続けるのが困難になる。加えて、話が長いほど認識品質が低くなることも多い。話が長いほど、話に含まれる単語が多いだけでなく単語結合も多い。まとめると、これらにより、音声認識エラーが増加し、理解の精度が低下する。 Traditionally, the ASR system begins to “listen” to it as it is spoken. If recognition automation fails, the user will wait for the length of time it takes for a complete utterance to be spoken, after which the HSR will begin to listen and process it. Instead, it would be desirable if the system could try to understand the dialog in near real time. For example, as the user speaks words and describes their meaning (or “intent”), it is first processed by the ASR and then processed by the HSR, resulting in the end of the utterance and the start of the response. A considerable time gap occurs. This time gap can be filled with audio reproduction such as typing sound, for example. This may work for some applications, especially for data gathering applications. In other applications, this time gap makes it difficult to continue a natural conversation with the system. In addition, the longer the story, the lower the recognition quality. The longer the story, the more words are included in the story, and the more the words are connected. In summary, these increase speech recognition errors and reduce the accuracy of understanding.

従って、人間援助を使用する前にできるだけ早く理解してうまくいく認識を予測し、人間らしい対話を維持することのできる、自動認識システムが必要とされている。さらに、人間援助が求められる場合もあるので、この自動認識システムはまた、人間援助のスタッフ配置を監視して、システムステータス負荷および人間援助スキルセット能力に応じて、理解の信頼度を自動的に調節することおよび／または完全な自動化に進むことができることも必要とする。 Therefore, there is a need for an automatic recognition system that can predict perceived and work well as soon as possible before using human assistance and maintain human-like dialogue. In addition, since human assistance may be required, the automatic recognition system also monitors human assistance staffing and automatically increases confidence in understanding depending on system status load and human assistance skillset capabilities. They also need to be able to adjust and / or proceed to full automation.

自然言語理解（ＮＬＵ）システムなど、より大がかりなシステムは、使用可能な結果をより大きな文法および語彙から得るために、骨の折れる手仕事による文法記述の多大な機械学習期間を必要とする。特に、語彙が動的である可能性のある環境（新しい演劇、または新しい音楽グループによるコンサートの、チケット注文をとるシステムなど）では、学習期間は、満足のいく結果をもたらすためにはあまりにも長すぎることがある。アクセント、方言、語彙および文法の地域差などを含めると、このようなシステムが認識精度の妥当な閾値を達成できるようにシステムに教えるタスクは、さらに複雑になる。 Larger systems, such as the Natural Language Understanding (NLU) system, require a significant machine learning period of tedious manual grammar descriptions to obtain usable results from larger grammars and vocabularies. Especially in environments where the vocabulary may be dynamic (such as a system for taking tickets for new plays or concerts with new music groups), the learning period is too long to produce satisfactory results. It may be too much. The inclusion of accents, dialects, vocabulary, grammatical regional differences, etc., further complicates the task of teaching systems such that such systems can achieve reasonable thresholds of recognition accuracy.

現在利用可能なＡＳＲシステムは、数、データ並びに単純な文法（すなわち、小さい単語のセット、およびそれらからなる表現）など、単純な口頭の発話を認識するのには効果がある。しかし、今までのところ、ＡＳＲシステムは、自由に流動する会話を提供する音声インタフェースを生み出すだけの十分に高いレベルの音声認識性能を提供していない。加えて、ＡＳＲ性能は、上述したようなアクセントや方言によって劣化するだけでなく、背景雑音、子供の声よりも大人の声、および多くの場合に男性の声よりも女性の声によっても劣化する。ＡＳＲ性能は時の経過に伴って向上しており、あるシステムは、発信者からの極めて幅広い応答を認識するように意図された統計言語モデルを使用し、従って発信者は、非常に制約された話し方で話すのではなく自然に話すときでも認識されることが可能である。そうであっても、ＡＳＲ性能は依然として人間同士の実際の対話には匹敵しておらず、最高レベルの性能を提供するＡＳＲシステムは、時間がかかり、構築して特定の適用例に整調する（tune）のが高価である。 Currently available ASR systems are effective in recognizing simple verbal utterances, such as numbers, data, and simple grammars (ie, small word sets and expressions composed of them). To date, however, ASR systems do not provide a sufficiently high level of speech recognition performance to create a speech interface that provides free-flowing conversations. In addition, ASR performance is not only degraded by accents and dialects as described above, but also by background noise, adult voices than children's voices, and often female voices more than male voices . ASR performance has improved over time, and some systems use a statistical language model that is intended to recognize a very wide range of responses from the caller, so the caller is very constrained It is possible to be recognized even when speaking naturally rather than speaking in a way of speaking. Even so, ASR performance is still not comparable to actual human interaction, and ASR systems that provide the highest levels of performance are time consuming and can be built and tuned to specific applications ( tune) is expensive.

予想される様々な回答の統計的確率並びに類義語を考慮することによって文法を整調することは、ＡＳＲ性能を向上させるために使用される技法の１つである。別の技法は、統計言語モデルを作り出すことだが、これは、生のオペレータとの生の電話会話の発話の録音を文字に起こすためのかなりの労力を必要とする可能性がある。ＡＳＲ性能は、ある適用例ではかなり許容できるが、他の適用例ではまだ適さず、従って、知られているＡＳＲベースのシステムは依然として、制約されない自然な発話を理解する能力に欠ける。 Tuning the grammar by taking into account the statistical probabilities of the various answers expected and synonyms is one technique used to improve ASR performance. Another technique is to create a statistical language model, which can require considerable effort to transcribe a recording of the speech of a live telephone conversation with a live operator. ASR performance is fairly acceptable in some applications, but is not yet suitable in other applications, so known ASR-based systems still lack the ability to understand unconstrained natural utterances.

従って、構成するＡＳＲコンポーネントの制限なしに、一貫して高品質な体験を提供する対話式システムが、依然として当技術分野で必要とされている。 Accordingly, there remains a need in the art for interactive systems that provide a consistently high quality experience without the limitations of configuring ASR components.

米国特許第６，４１１，６８６号明細書US Pat. No. 6,411,686 米国特許第６，４９９，０１３号明細書US Pat. No. 6,499,013 米国特許第５，９８７，１１６号明細書US Pat. No. 5,987,116 米国特許第５，０３３，０８８号明細書US Pat. No. 5,033,088 米国特許第７，６０６，７１８号明細書US Pat. No. 7,606,718

対話式応答システムが、ＨＳＲサブシステムをＡＳＲサブシステムと混合して、自然言語理解を容易にし、音声ユーザインタフェースの能力全体を改善する。このシステムは、不完全なＡＳＲサブシステムが、必要時にＨＳＲを使用でき、それでもなお、負荷がかかっているＨＳＲサブシステムの負担を軽減できるようにする。ＡＳＲプロキシを使用してＩＶＲシステムが実現され、このプロキシは、一連の規則に基づいて、発話を１つのＡＳＲのみにルーティングすること、発話を少なくとも１つのＡＳＲに加えてＨＳＲにもルーティングすること、発話を１または複数のＨＳＲサブシステムのみにルーティングすること、ＡＳＲに元々送られた発話をＨＳＲにルーティングし直すこと、ＨＳＲを使用して１または複数のＡＳＲの整調および訓練を補助すること、並びに、複数のＡＳＲを使用して結果の信頼性を高めることを決定する。 An interactive response system mixes the HSR subsystem with the ASR subsystem to facilitate natural language understanding and improve the overall capabilities of the voice user interface. This system allows an incomplete ASR subsystem to use HSR when needed and still reduce the burden on the loaded HSR subsystem. An IVR system is implemented using an ASR proxy, which routes the utterance to only one ASR based on a set of rules, routes the utterance to the HSR in addition to at least one ASR, Routing utterances only to one or more HSR subsystems, rerouting utterances originally sent to the ASR to the HSR, using HSR to assist in pacing and training of one or more ASRs, and , Decide to use multiple ASRs to increase the reliability of the results.

一態様では、ＡＳＲプロキシは、認識決定エンジンおよび結果決定エンジンを備える。関連する一態様では、この２つのエンジンは、様々な用件フォーム中の情報スロットを正確に埋めるために、認識性能、自然言語理解、並びに認識および文法整調を容易にする。 In one aspect, the ASR proxy comprises a recognition decision engine and a result decision engine. In a related aspect, the two engines facilitate recognition performance, natural language understanding, and recognition and grammatical pacing to accurately fill information slots in various business forms.

さらに別の態様では、ＡＳＲプロキシは、アプリケーション基準と、認識信頼度予測と、履歴結果と、特定ユーザの声で経験される認識とのうちの、１または複数に基づいて、ＡＳＲリソースおよび／またはＨＳＲリソースを選択する。 In yet another aspect, the ASR proxy may determine the ASR resource and / or based on one or more of application criteria, recognition confidence prediction, historical results, and recognition experienced in a particular user's voice. Select an HSR resource.

さらに別の態様では、ＡＳＲプロキシは、ＡＳＲの使用を最大限にすること、またはやりとりをより「人間らしい」若しくはより「人間らしくない」ものにすることなど、様々なパラメータに基づいて構成可能である。 In yet another aspect, the ASR proxy can be configured based on various parameters, such as maximizing the use of ASR or making the interaction more “human” or “less human”.

さらに別の態様では、ＡＳＲプロキシは、ＨＳＲのシステムリソースキャパシティに自動的に合わせて、ＡＳＲまたはＤＴＭＦの使用を最大限にする。 In yet another aspect, the ASR proxy automatically adapts to the HSR system resource capacity to maximize the use of ASR or DTMF.

さらに別の態様では、ＡＳＲプロキシは、ＡＳＲ結果を分析する評価コンポーネントの結果を使用して、長さベースのテストに対する最適な長さと、種々のプロンプトへのユーザ応答に対する最適な品質測定基準レベルと、種々のプロンプトに対する最適な分類器とのうちの、１または複数を選択する。 In yet another aspect, the ASR proxy uses the results of the evaluation component that analyzes the ASR results to determine an optimal length for length-based testing and an optimal quality metric level for user responses to various prompts. Select one or more of the best classifiers for the various prompts.

さらに別の態様では、ＡＳＲプロキシによるＡＳＲリソースまたはＨＳＲリソースの選択は、ＡＳＲプロキシに音声認識を要請するソフトウェアアプリケーションにはトランスペアレントである。 In yet another aspect, the selection of ASR or HSR resources by the ASR proxy is transparent to software applications that request voice recognition from the ASR proxy.

さらに別の態様では、このシステムは、ＨＳＲ使用時のリアルタイムに近い形で、うまくいく自動認識を予測する方法を使用して、より人間らしい体験を維持する。 In yet another aspect, the system maintains a more human experience using a method that predicts successful automatic recognition in near real time when using HSR.

本開示において対象とされる特定の構成を他の様々な方式でも実現できることは、当業者なら認識するであろう。特段に定義しない限り、本明細書で使用される全ての技術用語および科学用語は、本開示の属する技術分野の当業者によって一般に理解されるのと同じ意味を有する。 Those skilled in the art will recognize that the particular configurations targeted in this disclosure can also be implemented in various other ways. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

前述の特徴は、本開示の範囲を逸脱することなく、単独でまたは組み合わせて使用することができる。本明細書に開示するシステムおよび方法の他の特徴、目的および利点は、後続の詳細な記述および図から明らかになるであろう。 The features described above can be used alone or in combination without departing from the scope of the present disclosure. Other features, objects and advantages of the systems and methods disclosed herein will become apparent from the detailed description and figures that follow.

さらに他の特徴および様々な利点は、添付図面と共に後続の詳細な記述を読めば明らかになるであろう。図面全体を通して、同じ参照文字は同じ部分を指す。
対話式応答システムのアーキテクチャの一実施形態を示すブロック図である。顧客と対話式応答システムと人間インタフェースとの間の通信の方法の一実施形態を示すフローチャートである。図２のコンテキストにおける、顧客／対話式応答システムの対話の一実施形態を示すチャートである。図２のコンテキストにおける、顧客意図およびデータをキャプチャするための一実施形態を示すコンピュータ画面ユーザインタフェースの図である。図２のコンテキストにおける、顧客／対話式応答システムの対話の一実施形態を示すチャートである。図２のコンテキストにおける、顧客意図およびデータをキャプチャするための一実施形態を示すコンピュータ画面ユーザインタフェースの図である。図２のコンテキストにおける、顧客／対話式応答システムの対話の一実施形態を示すチャートである。図２のコンテキストにおける、顧客意図およびデータをキャプチャするための一実施形態を示すコンピュータ画面ユーザインタフェースの図である。対話式応答システムのコンテキストでＥメールを処理するフローチャートである。訓練サブシステムを有する対話式応答システムのアーキテクチャの一実施形態を示すブロック図である。ＡＳＲ訓練に関する例示的な処理フロー８００の図である。本明細書で参照されるコンピュータ／プロセッサのいずれかとして使用されるコンピュータ９００の例を示す高レベルのブロック図である。異なる意図分析者によってオーディオストリームの意図およびデータを認識することのタイムライン表現の図である。ＡＳＲプロキシと対話するアプリケーションのブロック図であり、プロキシの主要なコンポーネントを示す図である。ＡＳＲを使用するかＨＳＲを使用するかまたは両方を使用するかを決定するための、認識決定エンジンのプロセスおよび決定フローを示す流れ図である。単一のＡＳＲを使用する結果決定エンジンのプロセスおよび決定フローを示す流れ図である。複数のＡＳＲを使用する結果決定エンジンのプロセスおよび決定フローを示す流れ図である。ＡＳＲとＨＳＲの両方を使用する結果決定エンジンのプロセスおよび決定フローを示す流れ図である。ＨＳＲを使用する結果決定エンジンのプロセスおよび決定フローを示す流れ図である。自動認識および人間援助認識による応答ギャップを示す時系列の図である。アプリケーション、およびＡＳＲプロキシとの対話のブロック図であり、ＡＳＲプロキシの主要なコンポーネントを示す図である。認識に関する統計とシステムステータスに関する情報とを用いた認識決定および結果決定のプロセスおよび決定フローを示す流れ図である。ＡＳＲ統計とシステムステータスとを用いた認識決定および結果決定を示す流れ図である。タイマＡＳＲ統計とシステムステータスとを用いた認識決定および結果決定を示す流れ図である。予測器認識ＡＳＲ統計とシステムステータスとを用いた認識決定および結果決定を示す流れ図である。統計を生み出すためのプロセスを示すフローの図である。統計を生み出すためのいくつかの認識最適化基準の例を示す図である。 Still other features and various advantages will be apparent from the following detailed description when read in conjunction with the accompanying drawings. Like reference characters refer to like parts throughout the drawings.
1 is a block diagram illustrating one embodiment of an interactive response system architecture. FIG. 2 is a flowchart illustrating one embodiment of a method for communication between a customer, an interactive response system, and a human interface. Figure 3 is a chart illustrating one embodiment of customer / interactive response system interaction in the context of Figure 2; FIG. 3 is a computer screen user interface illustrating one embodiment for capturing customer intent and data in the context of FIG. 2. Figure 3 is a chart illustrating one embodiment of customer / interactive response system interaction in the context of Figure 2; FIG. 3 is a computer screen user interface illustrating one embodiment for capturing customer intent and data in the context of FIG. 2. Figure 3 is a chart illustrating one embodiment of customer / interactive response system interaction in the context of Figure 2; FIG. 3 is a computer screen user interface illustrating one embodiment for capturing customer intent and data in the context of FIG. 2. FIG. 5 is a flow chart for processing an email in the context of an interactive response system. 1 is a block diagram illustrating one embodiment of an interactive response system architecture having a training subsystem. FIG. FIG. 4 is an exemplary process flow 800 for ASR training. FIG. 6 is a high-level block diagram illustrating an example of a computer 900 used as any of the computers / processors referred to herein. FIG. 5 is a timeline representation of recognizing audio stream intent and data by different intent analysts. FIG. 2 is a block diagram of an application that interacts with an ASR proxy and shows the major components of the proxy. 2 is a flow diagram illustrating a recognition decision engine process and decision flow for determining whether to use ASR, HSR, or both. 2 is a flow diagram illustrating a process and decision flow for a results determination engine that uses a single ASR. 2 is a flow diagram illustrating a process and decision flow for a result determination engine that uses multiple ASRs. 2 is a flow diagram illustrating a process and decision flow for a results determination engine that uses both ASR and HSR. 2 is a flow diagram illustrating a process and decision flow for a result determination engine using HSR. It is a time-series figure which shows the response gap by automatic recognition and human assistance recognition. FIG. 2 is a block diagram of an application and interaction with an ASR proxy, showing the main components of the ASR proxy. 5 is a flow diagram illustrating a recognition decision and results decision process and decision flow using recognition statistics and system status information. FIG. 5 is a flow diagram showing recognition decisions and result decisions using ASR statistics and system status. FIG. 6 is a flow diagram showing recognition decisions and result decisions using timer ASR statistics and system status. FIG. FIG. 7 is a flow diagram showing recognition decisions and result decisions using predictor recognition ASR statistics and system status. FIG. 6 is a flow diagram illustrating a process for generating statistics. FIG. 6 shows examples of some recognition optimization criteria for generating statistics.

まず、図１〜図１０に従って、対話式応答システム並びに関連する機械学習システムおよびプロセスの動作についての記述を提供する。その後、図１１〜図１６に従って、ＡＳＲプロキシシステム並びにそれに関連するプロセスの動作についての記述を提供する。図１７〜図２４および対応する考察は一般に、ＡＳＲプロキシを最適化するプロセスに関し、目標は、コンピュータ認識の自動化と人間援助型認識との組合せを最適化すると同時にユーザ体験を向上させることである。特段に明白でない限り、本明細書で使用される用語「意図」および「意味」は、発話に対応するコンテキスト上の理由を指すことに留意されたい（例えば、新しいフライト予約をするという発信者の用件意図をシステムに決定させる）。対照的に、用語「認識する」およびその派生語は一般に、本明細書では、音をそれに対応する単語に変換するプロセスに使用される。 First, a description of the operation of the interactive response system and associated machine learning system and process is provided according to FIGS. Thereafter, in accordance with FIGS. 11-16, a description of the operation of the ASR proxy system and associated processes is provided. 17-24 and the corresponding discussion generally relate to the process of optimizing the ASR proxy, the goal is to optimize the combination of computer recognition automation and human assisted recognition while simultaneously improving the user experience. It should be noted that the terms “intent” and “meaning” as used herein refer to the contextual reason corresponding to the utterance (eg, for a caller who makes a new flight reservation), unless expressly specified otherwise. Let the system determine the intent of the request). In contrast, the term “recognize” and its derivatives are generally used herein in the process of converting a sound into a corresponding word.

人間援助型決定エンジンを使用して、マルチチャネルおよびマルチモーダルシステムが実現される。これは、「対話」を自動化にルーティングした後で、かつ自動化からの予測結果に応じて、予測データおよびキャパシティ要因のセットに基づいて、自動認識の競合の前であってもＨＳＲの使用を決定する。ある実施形態では、システムは、「発話」または「ビデオ」を自動的に加速させ、自動化と人間援助との間の時間ギャップをさらに短縮する。 Multi-channel and multi-modal systems are realized using a human assisted decision engine. This is the use of HSR even after auto-recognition conflicts, after routing "interactions" to automation, and depending on the prediction results from automation, based on a set of prediction data and capacity factors. decide. In some embodiments, the system automatically accelerates “speech” or “video” to further reduce the time gap between automation and human assistance.

プロンプトに対する応答の解釈は、テキスト分析の２つの種類、すなわち情報抽出およびセンス分類として見ることができる。情報抽出は、顧客ＩＤ、電話番号、日時、住所、製品タイプ、問題など、用件フォームのスロットを埋めるのに不可欠な特定の情報断片を、識別、抽出および正規化することである。センス分類は、追加の２つの情報タイプ、すなわち意味（意図）および応答品質を識別することに関係する。意味（意図）は、どんな種類のフォームを埋める必要があるかということと関係がある（料金請求、予約のスケジューリング、苦情など）。応答品質は、応答自体と関係がある（不明瞭、雑音、英語ではなくスペイン語、生のエージェントと話したいという要望など）。 Interpreting responses to prompts can be viewed as two types of text analysis: information extraction and sense classification. Information extraction is the identification, extraction and normalization of specific pieces of information that are essential to fill a slot in a business form, such as customer ID, phone number, date and time, address, product type, and problem. Sense classification is concerned with identifying two additional types of information: meaning (intent) and response quality. The meaning (intent) has to do with what kind of form needs to be filled in (billing, booking scheduling, complaints, etc.). Response quality has something to do with the response itself (indistinct, noisy, Spanish rather than English, a desire to talk to a live agent, etc.).

この応答解釈は、意図分析のみ（純粋なＨＳＲ）によって行うか、自動化（ＡＳＲおよび意図分類）によって行うか、またはＡＳＲとＨＳＲの何らかの組合せによって行うことができる。ＡＳＲ自動化の結果における信頼度測定基準を使用して、いつＡＳＲが信頼性のある結果を生成しているかを決定することで、限定的な品質損失でまたは品質損失なしに、ＡＳＲ自動化をＨＳＲに対してトレードオフすることが可能である。このことは、プロキシ処理システムにおけるこの２つの手法の組合せにより、ＨＳＲのみを使用する場合よりも大きなスループットを達成することができ、より小さい意図分析者チームでピーク需要負荷を処理できることを意味する。 This response interpretation can be done by intent analysis alone (pure HSR), by automation (ASR and intention classification), or by some combination of ASR and HSR. Use ASR automation results to determine when ASR is producing reliable results, with limited quality loss or without quality loss, to ASR automation It is possible to trade off. This means that the combination of the two approaches in a proxy processing system can achieve a greater throughput than if only HSR is used, and can handle peak demand loads with a smaller team of intention analysts.

図１は、対話式ルータ１０１（以下「ｉルータ」と呼ぶ）を介して対話プラットフォーム１０２を対話式応答システム１００に接続するアーキテクチャの一実施形態を示す。図１に示すように、対話プラットフォーム１０２は、通信リンク１０４を介して顧客１０３に接続される。対話プラットフォーム１０２はまた、データリンクを介してｉルータ１０１において対話式応答システム１００に接続され、データリンクは、この例示的な実施形態ではＴＣＰ／ＩＰデータリンクを含む。この例示的な実施形態における対話プラットフォーム１０２は、コンピュータサーバを含む。コンピュータサーバの正確な構成は実装形態によって異なるが、通常は、Ｄｉａｌｏｇｉｃ（登録商標）などのベンダからの音声ボードを使用してＷｉｎｄｏｗｓ（登録商標）やＬｉｎｕｘ（登録商標）などのオペレーティングシステムを実行するＰｅｎｔｉｕｍ（登録商標）ベースのサーバからなる。対話プラットフォーム１０２はまた、Ｅメールゲートウェイまたはウェブサーバとすることもできる。従って、顧客入力は、電話または構内通話を介して対話式応答システム１００に入り、テキストは、Ｅメールまたは対話式チャットインタフェース（例えば、ウェブページ、若しくはＹａｈｏｏＭｅｓｓｅｎｇｅｒなどのスタンドアロンアプリケーション）を介して入力される。 FIG. 1 illustrates one embodiment of an architecture that connects an interactive platform 102 to an interactive response system 100 via an interactive router 101 (hereinafter referred to as an “i-router”). As shown in FIG. 1, the interaction platform 102 is connected to a customer 103 via a communication link 104. The interaction platform 102 is also connected to the interactive response system 100 at the irouter 101 via a data link, which in this exemplary embodiment includes a TCP / IP data link. The interaction platform 102 in this exemplary embodiment includes a computer server. The exact configuration of the computer server varies depending on the implementation, but typically an audio board from a vendor such as Dialogic® is used to run an operating system such as Windows® or Linux® It consists of a Pentium (registered trademark) -based server. The interaction platform 102 can also be an email gateway or a web server. Thus, customer input enters the interactive response system 100 via telephone or local calls, and text is entered via email or an interactive chat interface (eg, a web page or a stand-alone application such as Yahoo Messenger). The

図１のアーキテクチャでは、様々な実施形態で、いくつかの異なるタイプのデバイスを使用して対話プラットフォーム１０２および通信リンク１０４の各々が実現される。対話プラットフォーム１０２は、顧客１０３と通信できる任意のデバイスによって実現することができる。例えば、対話プラットフォーム１０２は、一実施形態では、対話式応答システム１００中の電話サーバであり、この場合、顧客は電話をかけている。電話サーバは、入来呼の応答、転送および切断を扱う。電話サーバはまた、事前録音済みオーディオクリップのための倉庫であり、従って電話サーバは、任意のウェルカムプロンプト、およびｉルータ１０１によって指示された他のオーディオクリップを再生することができる。 In the architecture of FIG. 1, in various embodiments, each of interaction platform 102 and communication link 104 is implemented using a number of different types of devices. The interaction platform 102 can be implemented by any device that can communicate with the customer 103. For example, the interaction platform 102 is, in one embodiment, a telephone server in the interactive response system 100, where the customer is making a call. The telephone server handles incoming call answering, forwarding and disconnecting. The phone server is also a warehouse for pre-recorded audio clips, so the phone server can play any welcome prompt and other audio clips directed by the i-router 101.

本実施形態による電話サーバは、オフザシェルフ（off the shelf）コンポーネントから、例えば、オペレーティングシステムとしてのＷｉｎｄｏｗｓと、Ｐｅｎｔｉｕｍプロセッサなどの中央処理装置と、Ｉｎｔｅｌ（登録商標）Ｄｉａｌｏｇｉｃ音声ボードとから組み立てられる。このアーキテクチャを使用した場合、通信リンク１０４は、顧客の電話と電話サーバの間のインタフェースを提供する任意の手段によって実現される。例えば、通信リンク１０４は、様々な実施形態で、ダイヤルアップ接続または双方向ワイヤレス通信リンクである。 The telephone server according to this embodiment is assembled from off-the-shelf components, for example, Windows as an operating system, a central processing unit such as a Pentium processor, and an Intel (registered trademark) voice board. Using this architecture, the communication link 104 is implemented by any means that provides an interface between the customer's phone and the phone server. For example, the communication link 104 is a dial-up connection or a two-way wireless communication link in various embodiments.

別の例示的な実施形態では、対話プラットフォーム１０２は、対話式応答システム１００中のゲートウェイサーバである。この例示的な実施形態によれば、顧客は、Ｅメール、対話式テキストチャットまたはＶＯＩＰによって、対話式応答サーバと対話する。ゲートウェイサーバは、カスタマイズドオープンソースＥメール、ｗｗｗサーバソフトウェアまたはＳＩＰを実行する。さらに、この例示的な実施形態によるゲートウェイサーバは、Ｅメール、対話式テキストチャットまたはＶＯＩＰトランザクションを顧客と行うとともに、システムの他の要素とのデータ転送および受信もするように設計される。このアーキテクチャを使用した場合、通信リンク１０４は、顧客のコンピュータとゲートウェイサーバとの間のインタフェースを提供する任意の手段によって実現される。例えば、通信リンク１０４は、様々な実施形態で、専用インタフェース、単一のネットワーク、ネットワークの組合せ、ダイヤルアップ接続またはケーブルモデムである。 In another exemplary embodiment, the interaction platform 102 is a gateway server in the interactive response system 100. According to this exemplary embodiment, the customer interacts with the interactive response server via email, interactive text chat or VOIP. The gateway server executes customized open source email, www server software or SIP. In addition, the gateway server according to this exemplary embodiment is designed to conduct email, interactive text chat or VOIP transactions with customers, as well as transfer and receive data with other elements of the system. Using this architecture, the communication link 104 is implemented by any means that provides an interface between the customer's computer and the gateway server. For example, the communication link 104 is a dedicated interface, a single network, a combination of networks, a dial-up connection or a cable modem in various embodiments.

図１には対話プラットフォーム１０２が１つしか示されていないが、本明細書を検討した後には、複数の対話プラットフォーム１０２をこのシステム中で使用できることを当業者なら理解するであろう。対話プラットフォーム１０２が複数ある場合、対話式応答システムは、音声およびテキストデータを介して顧客と通信することができる。さらに、顧客ベースごとの専用対話プラットフォーム１０２によって、複数の顧客ベースに対応することもできる。このようにして、複数の対話プラットフォーム１０２のうちのどれが対話を開始したかを決定することによってワークフロー（後で詳述する）が選択される。 Although only one interaction platform 102 is shown in FIG. 1, those skilled in the art will appreciate that multiple interaction platforms 102 can be used in this system after reviewing this specification. If there are multiple interactive platforms 102, the interactive response system can communicate with the customer via voice and text data. In addition, a dedicated interaction platform 102 for each customer base can accommodate multiple customer bases. In this way, a workflow (described in detail below) is selected by determining which of the plurality of interaction platforms 102 has initiated the interaction.

図１のアーキテクチャでは、ｉルータ１０１は、対話式応答システム１００を制御するソフトウェアを備える。ｉルータ１０１は、他のコンポーネント間のアクティビティを調整しトランザクションを管理することによって、顧客１０３との対話を始めから終わりまで「所有する」。ｉルータ１０１は、１または複数のプログラム可能スクリプト（この例示的な実施形態によれば「ワークフロー」と呼ばれる）に従って、顧客１０３との対話を管理する。一般に、ワークフローは、ワークフローを通る経路が、顧客から入力された意図に依存するような、対話フローを含む。ワークフローは、システムエンジニアによって事前にプログラムされ、有利には、顧客満足や速度や精度などを向上させるために定期的に「小改良」される。この例示的な実施形態によれば、ｉルータ１０１は、ほぼ常に、ワークフロー中の次のステップまたは経路を選択することを「受け持っている」。 In the architecture of FIG. 1, the i router 101 includes software that controls the interactive response system 100. iRouter 101 “owns” the interaction with customer 103 from start to finish by coordinating activities between other components and managing transactions. iRouter 101 manages interactions with customer 103 according to one or more programmable scripts (referred to as “workflows” according to this exemplary embodiment). In general, a workflow includes an interaction flow in which a route through the workflow depends on an intention input from a customer. The workflow is pre-programmed by the system engineer and is advantageously "smallly improved" on a regular basis to improve customer satisfaction, speed and accuracy. According to this exemplary embodiment, iRouter 101 is almost always “responsible” to select the next step or path in the workflow.

ｉルータ１０１は、顧客コミュニケーションの形に応じて、オーディオクリップ、Ｅメール、テキストデータまたは他の対話タイプの形で、対話プラットフォーム１０２から入力された対話を受信する。ｉルータ１０１は、この入力を、１または複数の人間エージェント１０５（「意図分析者」すなわち「ＩＡ」と呼ばれることもある）、音声認識エンジンまたはエキスパートシステム（まとめて１０８、また「自動音声レコグナイザ」すなわち「ＡＳＲ」と呼ばれることもある）に転送し、応答を利用してその現在のワークフローを進める。入力を人間によって解釈（または翻訳）することが必要なときは、ｉルータ１０１は、現在のワークフローの適切な視覚コンテキストを表示するよう、人間エージェントのデスクトップソフトウェアに指示する。ｉルータ１０１が入力を理解すると、ｉルータ１０１は、ワークフローの中を進み、対話プラットフォーム１０２に、顧客１０３に適切に応答するよう指示する。 The i-router 101 receives the dialogue entered from the dialogue platform 102 in the form of an audio clip, email, text data, or other interaction type depending on the form of customer communication. The iRouter 101 receives this input from one or more human agents 105 (sometimes referred to as “intention analysts” or “IAs”), speech recognition engines or expert systems (collectively 108, and “automatic speech recognizers”). That is, it may be referred to as “ASR”), and the current workflow is advanced using the response. When the input needs to be interpreted (or translated) by a human, iRouter 101 instructs the human agent desktop software to display the appropriate visual context for the current workflow. When the i-router 101 understands the input, the i-router 101 proceeds through the workflow and instructs the dialogue platform 102 to respond appropriately to the customer 103.

対話プラットフォーム１０２が電話サーバを含む例示的な一実施形態では、ｉルータ１０１は、顧客に対して再生するためのサウンドクリップを送るか、テキスト−音声クリップを送るか、またはこの両方を送る。あるいは、対話プラットフォーム１０２は、サウンドクリップを記憶することができるか、テキスト−音声機能を有することができるか、またはこの両方とすることができる。この実施形態では、ｉルータは、顧客に対して何をいつ再生するかについて、対話プラットフォーム１０２に指示する。 In an exemplary embodiment where the interaction platform 102 includes a telephone server, the iRouter 101 sends a sound clip for playback, a text-voice clip, or both to the customer. Alternatively, the interaction platform 102 can store sound clips, have text-to-speech functionality, or both. In this embodiment, the i-router instructs the interaction platform 102 what to play and when to the customer.

ｉルータ１０１は、この例示的な実施形態では、ＷｉｎｄｏｗｓやＬｉｎｕｘなどのオペレーティングシステムを実行するネットワーク化されたオフザシェルフの市販プロセッサを備える。さらに、ｉルータ１０１のソフトウェアは、特定の適用例に適したオブジェクトを組み込んだ、修正されたオープンＶｏｉｃｅＸＭＬ（ＶＸＭＬ）ブラウザおよびＶＸＭＬスクリプトを含む。本明細書を検討した後には、これらのオブジェクトをどのように構築するかを当業者なら理解するであろう。 The iRouter 101 in this exemplary embodiment comprises a networked off-the-shelf commercial processor running an operating system such as Windows or Linux. In addition, the iRouter 101 software includes a modified Open VoiceXML (VXML) browser and VXML script that incorporates objects suitable for a particular application. Those skilled in the art will understand how to construct these objects after reviewing this specification.

図１の例示的なアーキテクチャによれば、対話式応答システム１００は、人間エージェント１０５の少なくとも１つのプールを含む。人間エージェント１０５のプールはしばしば、コンタクトセンタ所在地に位置する。人間エージェント１０５は、本発明のこの実施形態によれば、システム１００に特有の特殊化されたデスクトップソフトウェア（図３Ｂ、図４Ｂおよび図５Ｂに関してさらに後述する）を使用し、このソフトウェアは、可能性ある意図の集まりを、その時点までの顧客対話の履歴またはコンテキストと共に、それらの画面（それらのユーザインタフェース）上に提示する。１または複数の人間エージェント１０５は、入力を解釈し、適切な顧客意図、データ、またはこの両方を選択する。 According to the exemplary architecture of FIG. 1, interactive response system 100 includes at least one pool of human agents 105. The pool of human agents 105 is often located at the contact center location. The human agent 105 uses specialized desktop software specific to the system 100 (described further below with respect to FIGS. 3B, 4B, and 5B), according to this embodiment of the present invention, which is a possibility. A collection of intents is presented on those screens (their user interfaces) along with the history or context of customer interaction up to that point. One or more human agents 105 interpret the input and select the appropriate customer intent, data, or both.

電話対話の場合、人間エージェント１０５は、ヘッドホンを装着し、ｉルータ１０１の指示で電話サーバ１０２からストリーミングされるサウンドクリップ（「発話」）を聞く。本発明の一態様によれば、単一の人間エージェント１０５が顧客１０３に関するトランザクション全体を扱うことにはならない。そうではなく、人間エージェント１０５は、顧客１０３の発話を人間によって解釈することが必要であるものとしてワークフローデザイナによって指定された、トランザクションのいくつかの部分を扱う。ｉルータ１０１は、同じ顧客１０３対話を任意の数の人間エージェント１０５に送ることができ、所与の対話の一部を多くの異なる人間エージェント１０５に分配することができる。 In the case of a telephone conversation, the human agent 105 wears headphones and listens to a sound clip (“utterance”) streamed from the telephone server 102 according to an instruction from the i router 101. According to one aspect of the invention, a single human agent 105 does not handle the entire transaction for customer 103. Rather, the human agent 105 handles several parts of the transaction that are designated by the workflow designer as needing the human 103 to interpret the utterance of the customer 103. The i-router 101 can send the same customer 103 interaction to any number of human agents 105 and can distribute a portion of a given interaction to many different human agents 105.

本発明の例示的な実施形態によれば、人間エージェント１０５はオフサイト（Off site）であることが好ましい。さらに、人間エージェント１０５は、インド、フィリピンおよびメキシコなど、世界の種々の地理エリアに存在してよい。人間エージェント１０５は、建物内で集団になっていてもよく、または自宅から作業していてもよい。年中無休の人間エージェントサポートを必要とする適用例では、各人間エージェント１０５が適切な業務時間中に作業できるように、人間エージェント１０５を世界中に配置することができる。 According to an exemplary embodiment of the present invention, human agent 105 is preferably off site. Furthermore, the human agent 105 may be present in various geographical areas of the world, such as India, the Philippines, and Mexico. The human agents 105 may be a group in the building or may be working from home. In applications that require 24/7 human agent support, human agents 105 can be deployed throughout the world so that each human agent 105 can work during appropriate business hours.

本発明の対話式応答システム１００は、カスタム人間エージェントアプリケーションソフトウェアを利用する。人間エージェント１０５は、Ｊａｖａで開発され標準的なコールセンタコンピュータネットワークのワークステーション上で実行される、カスタムアプリケーションを使用する。概して言えば、対話式応答システム１００は、顧客１０３の入力の解釈に向かう人間の知能を、「意図」（顧客が何を欲するか）およびデータ（顧客が何を欲するかを決定するのに必要な任意のデータ）に適用する。解釈は通常、この例示的な実施形態では、何が言われたかについての最も正しい解釈を選択肢のリストから選択することを含む。代替の一実施形態では、コンピュータ支援型データ入力（例えば、テキスト入力またはＥメールアドレス入力のオートコンプリート）が、エージェント処理と共に使用される。 The interactive response system 100 of the present invention utilizes custom human agent application software. The human agent 105 uses a custom application developed in Java and running on a standard call center computer network workstation. Generally speaking, the interactive response system 100 requires human intelligence to interpret the customer's 103 input to determine "intention" (what the customer wants) and data (what the customer wants). Applicable to any data). Interpretation typically includes, in this exemplary embodiment, selecting from the list of choices the most correct interpretation of what was said. In an alternative embodiment, computer-aided data entry (eg, text entry or email address entry autocomplete) is used in conjunction with agent processing.

オフザシェルフコンポーネントである本発明のワークフローサーバ１０６は、対話ルータによって使用されるワークフローのアーカイブである。ワークフローサーバ１０６は、一実施形態では、標準的なサーバオペレーティングシステムを実行する市販のプロセッサを使用して、オフザシェルフハードウェアによって構築され、この例示的な実施形態では、ワークフロードキュメントはＸＭＬで書かれる。ワークフローサーバ１０６は、ｉルータ１０１の挙動を統制する業務規則のまとまりを維持する。 The workflow server 106 of the present invention, which is an off-the-shelf component, is an archive of workflows used by the interactive router. The workflow server 106, in one embodiment, is built with off-the-shelf hardware using a commercially available processor running a standard server operating system, and in this exemplary embodiment, the workflow document is written in XML. . The workflow server 106 maintains a set of business rules that regulate the behavior of the i router 101.

対話式応答システム１００は、ワークフローを策定するためにビジネス分析者またはプロセス技術者によって使用されるワークフローデザイナを利用する。ワークフローは、音声認識とのまたは人間エージェントとの所与の対話においてｉルータ１０１が従うマップとしての働きをする。ワークフローは、顧客入力に応答して、ワークフロー中の経路に沿ってｉルータ１０１の「舵をとる」。ワークフロー中の場所は、その時点までに収集されたデータと共に、「コンテキスト」と呼ばれる。 The interactive response system 100 utilizes a workflow designer used by a business analyst or process engineer to formulate a workflow. The workflow acts as a map that the iRouter 101 follows in a given interaction with speech recognition or with a human agent. In response to the customer input, the workflow “steers” the i-router 101 along the route in the workflow. The location in the workflow is called the “context” along with the data collected up to that point.

ワークフローデザイナは、人間エージェント１０５の意図解釈をガイドするために、人間エージェント１０５に対する命令をワークフローに構築する。ワークフローデザイナは、ＸＭＬドキュメントの構築に焦点を合わせるようにカスタマイズされたＥｃｌｉｐｓｅ（登録商標）ソフトウェア開発環境のバージョンを含んでよい。しかし、本明細書を検討した後には、当業者ならワークフローデザイナを開発できるであろう。 The workflow designer constructs an instruction for the human agent 105 in the workflow in order to guide interpretation of the intention of the human agent 105. The workflow designer may include a version of the Eclipse software development environment customized to focus on building XML documents. However, after reviewing this specification, one of ordinary skill in the art will be able to develop a workflow designer.

本発明の、性能および対話アーカイブ１０７は、任意の一般的なコンピュータサーバハードウェア上で維持できるデータベースを含む。性能および対話アーカイブ１０７は、顧客１０３とのシステムトランザクションのアーカイブデータ（すなわち、顧客１０３との対話からのサウンドクリップ、Ｅメール、チャットなどのリポジトリ）と、人間エージェント１０５についての性能データとの、両方を含む。 The performance and interaction archive 107 of the present invention includes a database that can be maintained on any common computer server hardware. The performance and interaction archive 107 is both archive data of system transactions with the customer 103 (ie, a repository of sound clips, emails, chats, etc. from the interaction with the customer 103) and performance data about the human agent 105. including.

この例示的な実施形態は、対話のグループに関する統計を生成するために、または人間エージェント１０５の性能ランキングを表示するために、「リポータ」ソフトウェアを利用する。リポータソフトウェアはまた、対話アーカイブ１０７に記憶された顧客１０３のコンタクトを構成したサウンドクリップ、Ｅメール、またはチャットテキストから、顧客１０３との対話を再構築することができる。リポータソフトウェアは、一連の単純なスクリプトであり、任意の一般的なサーバハードウェア上で実行されてよい。 This exemplary embodiment utilizes “reporter” software to generate statistics about groups of interactions or to display the performance ranking of human agents 105. The reporter software can also reconstruct the interaction with the customer 103 from the sound clip, email, or chat text that made up the contact of the customer 103 stored in the interaction archive 107. Reporter software is a series of simple scripts that may be executed on any common server hardware.

この例示的な実施形態はまた、マネージャ／管理者ソフトウェアも含み、このマネージャ／管理者ソフトウェアは通常、リポータソフトウェアと同じステーションから実行される。マネージャ／管理者ソフトウェアは、対話式応答システム１００についての動作パラメータを設定する。このような動作パラメータは、負荷平衡、ワークフロー中の変更のアップロード、および他の管理変更のための、業務規則を含むが、これらに限定されない。特定の一実施形態では、マネージャ／管理者ソフトウェアは、標準的なコールセンタコンピュータワークステーション上で実行される小さいカスタムＪａｖａ（登録商標）アプリケーションである。 This exemplary embodiment also includes manager / administrator software, which is typically run from the same station as the reporter software. The manager / administrator software sets operating parameters for the interactive response system 100. Such operational parameters include, but are not limited to, business rules for load balancing, uploading changes during a workflow, and other management changes. In one particular embodiment, the manager / administrator software is a small custom Java application that runs on a standard call center computer workstation.

サポートシステム１０８は、顧客１０３の要求に応答する際に利用できる多くのデータベースおよび顧客プロプライエタリシステム（Ｎｕａｎｃｅ（登録商標）などのオフザシェルフ自動音声認識（ＡＳＲ）ソフトウェアも含む）からなる。例えば、サポートシステム１０８は、顧客情報または知識ベースのためのデータベースを含んでよい。音声認識ソフトウェアは、この例示的な実施形態では、顧客１０３の発話を解釈するのに使用されるオフザシェルフコンポーネントである。サポートシステム１０８はまた、テキスト−音声機能も含んでよく、これはしばしば、顧客１０３に対してテキストを読み上げるオフザシェルフソフトウェアである。 Support system 108 consists of a number of databases and customer proprietary systems (including off-the-shelf automatic speech recognition (ASR) software such as Nuance®) that can be used in responding to customer 103 requests. For example, the support system 108 may include a database for customer information or knowledge base. The speech recognition software is an off-the-shelf component that is used in this exemplary embodiment to interpret the customer 103 utterances. Support system 108 may also include text-to-speech functionality, which is often off-the-shelf software that reads text to customer 103.

本発明の会社エージェント１０９は、ワークフローが問い合わせをする、顧客１０３要求を扱う人間エージェントからなる。例えば、顧客１０３が会社のことで援助を得ようと意図しており、外部委託された人間エージェント１０５がこの意図を識別した場合、ワークフローは、電話を会社エージェント１０９に転送するよう、対話式応答システム１００に指示することができる。 The company agent 109 of the present invention is a human agent that handles a customer 103 request for which a workflow makes an inquiry. For example, if the customer 103 intends to get help with a company and the outsourced human agent 105 identifies this intention, the workflow will respond interactively to forward the call to the company agent 109. System 100 can be instructed.

対話式応答システム１００の要素は、この例示的な実施形態ではＴＣＰ／ＩＰネットワークを介して通信する。通信は、ｉルータ１０１が従うワークフローによって駆動される。この実施形態における「データベース」は、フラットファイルデータベース、関係データベース、オブジェクトデータベースまたはこれらの任意の組合せとすることができる。 Elements of the interactive response system 100 communicate via a TCP / IP network in this exemplary embodiment. Communication is driven by a workflow that the i-router 101 follows. The “database” in this embodiment can be a flat file database, a relational database, an object database, or any combination thereof.

次に図２から図５に移るが、これらの図は、顧客が電話を介して対話式応答システム１００と対話するときに、どのように対話式応答システム１００によって情報が取り出され処理されるかについての例を示す。図２に示す例は、必要な全てのハードウェア、ソフトウェア、ネットワーキングおよびシステム統合が完全であること、並びに、ビジネス分析者がグラフィックワークフローデザイナを使用して顧客対話における可能性あるステップを策定済みであることを前提とする。ビジネス分析者はまた、対話式応答システムが顧客１０３に対して言うかもしれないどんなことについても、テキストを作成済みである。これらは、最初のプロンプト（例えば「お電話ありがとうございます。今日はどんなご用件ですか？」）、顧客への応答、追加情報の要求、「口ごもる音声」（ｉルータ１０１が応答を決定している間に顧客に送られる音）および締めくくりの言葉を含むが、これらに限定されない。テキスト−音声ソフトウェアまたはボイスタレントのいずれかが、ビジネス分析者によって書かれたサーバ側音声のそれぞれを録音する。このワークフローは、対話式応答システム１００にロードされ、そこでｉルータ１０１によって利用可能である。 Turning now to FIGS. 2-5, these figures illustrate how information is retrieved and processed by the interactive response system 100 when a customer interacts with the interactive response system 100 via the telephone. An example of The example shown in Figure 2 shows that all the necessary hardware, software, networking and system integration is complete, and that the business analyst has used the graphical workflow designer to develop the possible steps in customer interaction. It is assumed that there is. The business analyst has also created text for anything that the interactive response system might say to the customer 103. These are the first prompts (for example, “Thank you for calling. Sound that is sent to the customer while on) and closing words. Either text-to-speech software or voice talent records each server-side voice written by the business analyst. This workflow is loaded into the interactive response system 100 where it can be used by the i-router 101.

ブロック２０１に示すように、対話は、顧客１０３が会社の顧客サービス電話番号に電話することで開始する。対話プラットフォーム１０２（この場合は電話サーバ）が、電話に応じ、ブロック２０２に示すように、（１）発信者のＡＮＩ／ＤＮＩＳ情報、または（２）他の業務規則（例えば、電話が入来した回線または中継線）のいずれかに基づいて、ワークフローデータベースに記憶された適切なワークフローを取り出す。電話サーバは、ブロック２０３に示すように適切なウェルカムプロンプトを再生し、顧客はこのプロンプトに応答する（ブロック２０４）。 As shown in block 201, the conversation begins when customer 103 calls the company's customer service phone number. The interaction platform 102 (in this case a telephone server) responds to the call, as shown in block 202, (1) the caller's ANI / DNIS information, or (2) other business rules (eg, a phone call came in) The appropriate workflow stored in the workflow database is retrieved based on either the line or the trunk line). The phone server plays the appropriate welcome prompt as shown at block 203 and the customer responds to this prompt (block 204).

例えば、架空の航空会社であるインターエアが、本発明のコールセンタ実施形態による対話式応答システムを介して顧客サービスを提供する。従って、対話プラットフォーム１０２は電話インタフェースであり、ｉルータ１０１は、インターエアにふさわしいワークフローを選択する。 For example, Inter Air, a fictitious airline, provides customer service via an interactive response system according to a call center embodiment of the present invention. Accordingly, the interactive platform 102 is a telephone interface, and the i-router 101 selects a workflow suitable for the internet.

図３Ａの例証的なワークフローに、ワークフロー中の第１のポイントまたはコンテキストを示す。顧客発話はなく、従って、キャプチャすべき（かつ応答すべき）意図またはデータはない。唯一の応答は、挨拶、および顧客入力を求めるプロンプトである。 The exemplary workflow of FIG. 3A shows a first point or context in the workflow. There is no customer utterance, and therefore no intention or data to capture (and respond). The only response is a greeting and a prompt for customer input.

処理は図２のフローチャート中のボックス２０４に進む。電話サーバは、顧客の口頭入力をディジタル化するのを開始し、ｉルータに接続する。この時点で、ワークフローまたは業務規則は、顧客に対する対話式応答を人間エージェントによって扱う必要があるのか音声認識ソフトウェアによって扱う必要があるのかを決定する。すなわち、ｉルータは、電話のための適切なワークフローをワークフローリポジトリから選択し、ワークフロー規則に従って顧客との会話を行う。 Processing proceeds to box 204 in the flowchart of FIG. The telephone server starts digitizing the customer's verbal input and connects to the i-router. At this point, the workflow or business rule determines whether the interactive response to the customer needs to be handled by a human agent or by voice recognition software. That is, the i-router selects an appropriate workflow for the phone call from the workflow repository and conducts a conversation with the customer according to the workflow rules.

顧客の言葉を解釈するために、ｉルータ１０１は適宜、ブロック２０５に示すように、サポートシステムからのＡＳＲを使用するか、または顧客のオーディオをコンタクトセンタ中の人間エージェント１０５にストリーミングさせる。人間エージェント１０５がワークフローによって必要とされる場合は、ｉルータ１０１は、ブロック２０７に示すように、負荷平衡アルゴリズムを適用することによって、利用可能な人間エージェントを識別し、彼らの画面上でポップアップをトリガし（図３Ｂの、最初は空のポップアップ画面に示すように）、いくつかの選択可能な意図オプションを提示し、識別された人間エージェントに顧客オーディオをストリーミングし始める。本開示が与えられれば当業者なら思いつくであろうが、この負荷平衡は、様々な時点で、様々な要因のいずれかに基づいて、発話を解釈するためのより多いかまたは少ない人間エージェントを識別することを含む。ブロック２１０および２１１に示すように、人間エージェントは、顧客の発話をヘッドホンで聞き、コンピュータソフトウェアが発話の解釈を促す。 To interpret the customer's language, iRouter 101 uses ASR from the support system, as appropriate, or causes the customer's audio to stream to human agent 105 in the contact center, as shown in block 205. If human agent 105 is required by the workflow, iRouter 101 identifies available human agents by applying a load balancing algorithm as shown in block 207 and pops up on their screen. Trigger (as shown in the empty pop-up screen of FIG. 3B, initially), presents several selectable intent options, and begins streaming customer audio to the identified human agent. As will be appreciated by those skilled in the art given this disclosure, this load balance identifies more or fewer human agents to interpret the utterance based on any of a variety of factors at various times. Including doing. As shown in blocks 210 and 211, the human agent listens to the customer's utterance with headphones and the computer software prompts interpretation of the utterance.

図４Ａの例示的なワークフローによれば、１または複数の人間エージェントが聞く顧客発話は、「今日の午後のシカゴからロンドンへの自分のフライトを確認したい。」である。図４Ｂに示すように、エージェントの画面は、現在のコンテキスト（またはワークフロー中のポイント）を示す。この例証的なスクリーンショットでは、人間エージェントが選択できる、可能性ある要求（回答不能および終了を含む）が１２個ある。稼働時には、エージェントに利用可能な可能性ある解釈は数百個ある。このように選択が多種多様であることで、解釈のフレキシビリティがエージェントに与えられ、これによりｉルータは、解釈された意図に従ってそのワークフロー中で跳び回ることができる。従って、本発明の一態様によれば、ｉルータは、顧客が途中で主題を変えたとしても、適切に応答することができる。 According to the exemplary workflow of FIG. 4A, the customer utterance heard by one or more human agents is “I want to confirm my flight from Chicago to London this afternoon”. As shown in FIG. 4B, the agent screen shows the current context (or point in the workflow). In this illustrative screenshot, there are 12 possible requests (including unanswerable and terminated) that a human agent can select. In operation, there are hundreds of possible interpretations available to agents. This variety of choices gives the agent the flexibility of interpretation, which allows the i-router to jump around in its workflow according to the interpreted intent. Therefore, according to one aspect of the present invention, the i-router can respond appropriately even if the customer changes the subject on the way.

それぞれの場合に、各エージェントは、ワークフローの現在のコンテキストで顧客発話の最もふさわしい解釈であると感じるものを選択する。図４Ｂの例では、人間エージェントは、「ＣＦＴ」（フライト時間の確認）を選択し、出発都市および到着都市（または、顧客が発話する可能性のある他の事前プログラム済み情報）を入力するかまたはドロップダウンメニューから選択する。 In each case, each agent selects the one that feels the most appropriate interpretation of the customer utterance in the current context of the workflow. In the example of FIG. 4B, the human agent selects “CFT” (Flight Time Confirmation) and enters the departure city and arrival city (or other pre-programmed information that the customer may speak). Or select from the drop-down menu.

ブロック２０８および２０９では、人間エージェントは、任意の応答遅延を補償するために、ステーションで受け取られた顧客オーディオクリップに加速を適用することを決定できることに留意されたい（応答遅延は通常、アプリケーションセットアップにおける遅れ時間、すなわち、人間エージェントのデスクトップソフトウェアがストリーミングオーディオを受けて適切なワークフローを表示するのにかかることになる時間に起因する）。ネットワークレイテンシは０．２秒前後である場合があり、アプリケーション遅延は、１＋秒の範囲でより大きい可能性がある。アプリケーション遅延を補償するために、対話式応答システムは、ボイスクリップを加速させる（ただし歪みが認識できるところまではしない）。この目的は、顧客が応答を待つ間に目立った遅延を体験しないように、より「リアルタイムの」会話対話に向けて努力することである。加速は、言葉が電話サーバから流れてくるのに伴ってその言葉に適用される。加速は、リンク固有のレイテンシを克服することはできないが、加速により、人間エージェントは、どんなアプリケーションセットアップ時間も「回復」して、対話における遅れ時間の量を、理想的にはネットワーク中のレイテンシによって課される限度まで削減することができる。しかし、加速は任意選択であり、初心者のエージェントはよりゆっくりした再生を必要とすることがあるが、より経験を積んだエージェントは加速を適用することができる。 Note that in blocks 208 and 209, the human agent can decide to apply acceleration to the customer audio clip received at the station to compensate for any response delay (response delay is typically in application setup). Delay time, ie, the time it takes for the human agent desktop software to receive the streaming audio and display the appropriate workflow). Network latency may be around 0.2 seconds, and application delays may be larger in the range of 1+ seconds. To compensate for application delay, the interactive response system accelerates the voice clip (but not to the point where distortion is perceivable). The goal is to strive for a more “real-time” conversational conversation so that the customer does not experience noticeable delays while waiting for a response. Acceleration is applied to words as they flow from the phone server. Acceleration cannot overcome link-specific latencies, but acceleration allows human agents to “recover” any application setup time, and to reduce the amount of latency in the conversation, ideally due to latency in the network. It can be reduced to the limit imposed. However, acceleration is optional, and beginner agents may require slower replay, while more experienced agents can apply acceleration.

テスト２１３で、ｉルータは、顧客オーディオ解釈の精度をリアルタイムで評価し、各エージェントの速度／精度プロファイルを更新する。ブロック２１４で、ｉルータは、解釈を処理してワークフロー中の次のステップ（例えば、入力データに基づくデータベース検索）を実施し、次に、電話サーバを介して適切な応答を顧客に転送する（２１８）（解釈が正確であると見なされる場合）。解釈が正確であるとｉルータが判定した場合、ｉルータは、音声認識ソフトウェアの解釈に基づいて、または１若しくは複数の人間エージェントの応答にキーアルゴリズムを適用することによって、応答の再生を電話サーバから顧客に向けて送る。この例では、応答は、図４Ａの画面２の最後のブロックで与えられる。 In test 213, the i-router evaluates the accuracy of customer audio interpretation in real time and updates the speed / accuracy profile of each agent. At block 214, the iRouter processes the interpretation to perform the next step in the workflow (eg, a database search based on input data) and then forwards the appropriate response to the customer via the telephone server ( 218) (if the interpretation is considered accurate). If the i-router determines that the interpretation is correct, the i-router may play the response based on the interpretation of the speech recognition software or by applying a key algorithm to the response of one or more human agents. To send to customers. In this example, the response is given in the last block of screen 2 in FIG. 4A.

精度を決定するために、ｉルータは、２人の人間エージェントの解釈を比較し、合意に達しない場合は、さらに解釈を求めて、第３の人間エージェントに対して顧客オーディオクリップを再生する（すなわち、「多数決規則」でどれが正確な応答かを決定する）。他の業務規則を使用して正確な解釈を決定してもよい。例えば、最も良い精度スコアを有するエージェントからの解釈を選択することができる。あるいは、解釈のうちの１つを選択して顧客に対して再生することができ（「・・・と仰っていると理解しております」）、顧客の応答が、その解釈が正しかったかどうかを決定する。さらに、既知のデータから解釈を選択することもできる（例えば、Ｅメールアドレスの２つの解釈を顧客Ｅメールアドレスのデータベースと比較することができる、クレジットカード番号の２つの解釈のうちの一方のみがチェックサムアルゴリズムをパスすることになる、など）。 To determine accuracy, the i-router compares the interpretations of the two human agents and, if no agreement is reached, seeks further interpretation and plays the customer audio clip to the third human agent ( That is, the “majority rule” determines which is the exact response). Other business rules may be used to determine the correct interpretation. For example, the interpretation from the agent with the best accuracy score can be selected. Alternatively, one of the interpretations can be selected and replayed to the customer ("I understand that ...") and the customer's response was correct To decide. In addition, an interpretation can be selected from known data (eg, only one of the two interpretations of a credit card number can compare two interpretations of email addresses with a database of customer email addresses. Pass the checksum algorithm, etc.).

対話式応答システムは、ほぼ任意の数の人間エージェントが一度に同じ顧客対話を扱うことを可能にする。即ち、対話式応答システムは、忙しい時間中は２人のエージェントが聞くようにすることができ、または、より暇な時間中は７人の人間エージェントが聞くようにすることができる。さらに、電話の量が多い時間中は、「二重チェック」規則をなくすことによって精度を低下させて速い応答時間を維持することができる。エージェントの速度／精度プロファイルに基づいて高い信用ランクが割り当てられたエージェントには、二重チェックなしで作業するよう求めることができる。より素早いシステム可用性に対して精度をトレードオフすることに加えて、オーディオクリップの途切れのない流れが各エージェントに流れており、それにより人間エージェントの「怠け」時間が低減される。 The interactive response system allows almost any number of human agents to handle the same customer interaction at a time. That is, the interactive response system can be listened to by two agents during busy hours, or can be listened to by seven human agents during spare time. Furthermore, during times of high phone volume, the “double check” rule can be eliminated to reduce accuracy and maintain fast response time. Agents assigned high trust ranks based on the agent's speed / accuracy profile can be asked to work without double checking. In addition to trading off accuracy for faster system availability, an uninterrupted flow of audio clips flows through each agent, thereby reducing the “lazy” time of human agents.

図２のフローチャートに戻り、ブロック２０４に見られるように顧客が再び応答することになるか、ブロック２１５に示されるように、電話が転送されることになるか（ワークフロー中のステップによって若しくは業務規則によってそのように指示された場合）または顧客が電話を終了する。ブロック２１３で解釈が不正確であると見なされる場合は、ｉルータ１０１は、時間稼ぎ音声を顧客に対して再生し（ブロック２１６）、別の解釈を求めてオーディオクリップを追加の人間エージェントに送り（ブロック２１７）、その精度を再評価する。 Returning to the flowchart of FIG. 2, whether the customer will respond again, as seen in block 204, or will the call be forwarded, as shown in block 215 (depending on the step in the workflow or business rules)? If so instructed by) or the customer to end the call. If the interpretation is deemed inaccurate at block 213, iRouter 101 plays the time earned voice to the customer (block 216) and sends the audio clip to an additional human agent for another interpretation. (Block 217), the accuracy is reevaluated.

ｉルータは、ワークフローをそのガイドとして使用して、顧客との対話を電話完了まで管理する。ｉルータは、電話の中の多くの時点で、解釈を求めて顧客発話を人間エージェントにストリーミングすることができる。電話が終結すると、顧客対話のスナップショットがアーカイブデータベースに保存される。人間エージェントの速度／精度プロファイルは、常に更新され維持される。 The iRouter uses the workflow as its guide to manage customer interaction until the phone is completed. The i-router can stream customer utterances to human agents for interpretation at many points in the phone call. When the call ends, a snapshot of the customer interaction is stored in the archive database. The human agent's speed / accuracy profile is constantly updated and maintained.

顧客の要求を解釈するのに人間の介入が必要ない場合はブロック２０６および２１４に示すように、ＡＳＲがオーディオクリップを解釈し、ｉルータが適切な応答を決定する。 If no human intervention is required to interpret the customer request, the ASR interprets the audio clip and the i-router determines the appropriate response, as shown in blocks 206 and 214.

インターエアの例を続けるが、図５Ａに見られるように、キャプチャされた顧客発話は、２つの要求、すなわち食べ物および娯楽の問合せを有する。本発明の別の態様によれば、人間エージェントは、２つの意図、すなわち食事および映画を捕える。入力すべき関連のあるデータはない。というのは、対話式応答システムは、図４Ｂで入力された前のデータ（このデータは図５Ｂで見える）から、フライト情報を既に知っているからである。図５Ｂに見られるように、人間エージェントは、可能性ある意図のオンスクリーン表示から「一般」および「食事」を入力する。人間エージェントはまた「映画」も入力する。図５Ａに見られるように、対話式応答システムは適切な応答を提供する。図５Ｂに見られるように、顧客が、「どんな食事が出ますか？」、「特別食はありますか？」、「映画の年齢制限はどの区分ですか？」など、食事または映画に関するさらに他の情報を要求した場合、適切な人間エージェント解釈オプションがコンピュータ画面上で突き止められる。 Continuing with the example of Interair, as seen in FIG. 5A, the captured customer utterance has two requests: a food and entertainment query. According to another aspect of the invention, the human agent captures two intentions: a meal and a movie. There is no relevant data to enter. This is because the interactive response system already knows the flight information from the previous data entered in FIG. 4B (this data is visible in FIG. 5B). As seen in FIG. 5B, the human agent inputs “general” and “meal” from an on-screen display of possible intents. The human agent also enters a “movie”. As seen in FIG. 5A, the interactive response system provides an appropriate response. As can be seen in FIG. 5B, the customer may ask for other meals or movies such as “What meals do you have?”, “Is there any special meals?”, “What are the age restrictions for movies?” The appropriate human agent interpretation options are located on the computer screen.

図６は、顧客が電子メール（当技術分野で一般に知られているＥメール）を介して対話するときに、どのように対話式応答システムによって情報が取り出され処理されるかについての例を示す。ブロック６０１に示すように、対話は、顧客が会社の顧客サービスＥメールアドレスにＥメールを送ることで開始する。対話プラットフォーム（この例示的な実施形態ではゲートウェイサーバ）が、Ｅメールを開き、６０２に示すように、（１）顧客のｔｏ／ｆｒｏｍ情報と（２）他の業務規則とのいずれかに基づいて、ワークフローデータベースに記憶された適切なワークフローを取り出す。ゲートウェイサーバは、６０２に示すように、適切な応答承認を送る。ｉルータ１０１は、ブロック６０３に示すように、負荷平衡アルゴリズムを適用することによって、Ｅメールを扱うための利用可能な人間エージェントを識別し、彼らの画面上でポップアップをトリガして解釈のための可能性ある意図を示し、１または複数の人間エージェントにＥメールの内容を送る。人間エージェントは、ブロック６０４および６０５に示すように、Ｅメールを解釈する。テスト６０６で、ｉルータ１０１は、顧客Ｅメール解釈の精度をリアルタイムで評価し、各エージェントの速度／精度プロファイルを更新するが、このテストの後、ｉルータ１０１は、解釈を処理し、それに従ってワークフロー中の次のステップを実施する。最終的に、ｉルータ１０１は、ブロック６０７に見られるように、ゲートウェイサーバを介して適切なＥメール応答を顧客に転送する（解釈が正確であると見なされる場合）。ブロック６０８に示すように、Ｅメールは適切なデータベースにアーカイブされる。解釈が不正確であると見なされる場合は、ｉルータ１０１は、別の解釈を求めてＥメールを別の人間エージェントに送り（ブロック６０９）、その精度を再評価する。ｉルータ１０１は、ワークフローをそのガイドとして使用して、顧客との対話をＥメール応答まで管理する。 FIG. 6 shows an example of how information is retrieved and processed by an interactive response system when a customer interacts via email (e-mail generally known in the art). . As shown in block 601, the interaction begins with the customer sending an email to the company's customer service email address. An interaction platform (in this exemplary embodiment, a gateway server) opens an email and, as indicated at 602, based on either (1) customer to / from information and (2) other business rules. Retrieve the appropriate workflow stored in the workflow database. The gateway server sends an appropriate response acknowledgment as shown at 602. The iRouter 101 identifies available human agents for handling email by applying a load balancing algorithm, as shown in block 603, and triggers a pop-up on their screen for interpretation. Indicates possible intent and sends email content to one or more human agents. The human agent interprets the email as shown in blocks 604 and 605. In test 606, iRouter 101 evaluates the accuracy of customer email interpretation in real time and updates the speed / accuracy profile of each agent, but after this test, iRouter 101 processes the interpretation and follows it accordingly. Perform the next step in the workflow. Eventually, iRouter 101 forwards the appropriate email response to the customer via the gateway server, as seen at block 607 (if the interpretation is deemed accurate). As shown in block 608, the email is archived in an appropriate database. If the interpretation is deemed inaccurate, iRouter 101 sends an email to another human agent for another interpretation (block 609) and reevaluates its accuracy. The i-router 101 manages the interaction with the customer up to the e-mail response using the workflow as a guide.

図１〜図６に関する上記の対話式応答システムおよびそれを構成するプロセスに対する考察は、１または複数の音声認識および関連サブシステム１０８の動作を含む。ＩＶＲシステム１００の実現には、実際、人間による対話の必要性を最小限に抑えるためにこのようなサブシステム１０８が顧客の発話のかなりの部分を認識できることが必要である。 Considerations for the interactive response system described above with respect to FIGS. 1-6 and the processes that comprise it include the operation of one or more speech recognition and related subsystems 108. Implementation of the IVR system 100 actually requires that such a subsystem 108 be able to recognize a significant portion of customer utterances in order to minimize the need for human interaction.

次に図７を参照すると、ＩＶＲシステム１００の一部として、訓練サブシステム７１０が含まれる。稼働時には、訓練サブシステム７１０は、サブシステム１０８中のリアルタイムＡＳＲに機械学習機能を選択的に提供して、新しいまたは変更された顧客対話に対してこれらが非常に素早く適応できるようにする。例えば、ＩＶＲシステム１００が会社に対して最初にインストールされたとき、組込みＡＳＲの一般的な機能は、実際の顧客対話にはあまり使えないことがあり、特に、これらの対話が業界特有の用語を多く含む場合にはそうである（例えば、接地事故回路遮断装置を注文するために電話する電気工は通常、「ＧＦＣＩ」という頭字語を使用するであろうが、これを容易に認識するＡＳＲはほとんどないであろう）。同様に、新しい提供物が利用可能になったとき、既存のＡＳＲ機能は、前はうまくいっていたにもかかわらず障害を起こし始めることがある（例えば、過去の使用において「ｉＰｏｄ（登録商標）」を正しく識別したＡＳＲが、「ｉＰａｄ（登録商標）」など似た名称の別の製品が導入されると障害を起こし始めることがある）。これらの変更は、ある適用例では頻繁でない場合があるが、他の適用例では定期的に発生する場合がある。例えば、ロックコンサートのチケットを販売するための適用例は、バンド名に対する新しい顧客要求に定期的に適応することが必要になる。 Referring now to FIG. 7, a training subsystem 710 is included as part of the IVR system 100. In operation, the training subsystem 710 selectively provides machine learning functions to the real-time ASR in the subsystem 108 so that they can adapt very quickly to new or changed customer interactions. For example, when the IVR system 100 is first installed for a company, the general functionality of the embedded ASR may not be very useful for actual customer interaction, especially when these interactions use industry-specific terminology. That's the case with many (for example, an electrician who calls to order a ground fault circuit breaker would typically use the acronym “GFCI”, but the ASR that easily recognizes this is There will be little). Similarly, when new offerings become available, existing ASR functionality may begin to fail even though it used to work (eg, “iPod®” in past use). ASR that correctly identified can start to fail if another product with a similar name such as “iPad®” is introduced). These changes may not be frequent in some applications, but may occur regularly in other applications. For example, applications for selling rock concert tickets will need to adapt regularly to new customer requirements for band names.

一実施形態では、訓練は、このような訓練に対する指示された必要性に基づいて行われる。ＡＳＲの精度が容認性閾値よりも十分に高い既存のシステムの場合、訓練は、仮に行われるとしても、たまにしか行われない可能性がある。このような場合、訓練は、例えば、電話の量が極めて少ない期間中（この期間中は、ＩＡ１０５は通常なら比較的暇である）だけ行うことができる。システムが新しい場合は、またはＡＳＲの成功が容認可能限度未満に下落しているときは常に、より多くの訓練が必要とされてよく、従って訓練サブシステム７１０はより頻繁にアクティブになる。 In one embodiment, training is performed based on the indicated need for such training. For existing systems where the accuracy of ASR is well above the acceptability threshold, training may occur only occasionally, even if it occurs. In such a case, training can be performed, for example, only during periods of very low phone volumes (in which period the IA 105 is normally relatively idle). More training may be required if the system is new, or whenever the success of the ASR falls below acceptable limits, so the training subsystem 710 becomes active more frequently.

訓練サブシステム７１０の非リアルタイム訓練ＡＳＲ７１１は、入力として、顧客の発話をｉルータ１０１から受け取り、対応する意図をＩＡ１０５から受信する。実際には、後述するように複数の訓練ＡＳＲ７１１を使用することができる。 The non-real-time training ASR 711 of the training subsystem 710 receives the customer utterance from the i-router 101 as an input and the corresponding intention from the IA 105 as input. In practice, a plurality of training ASRs 711 can be used as described below.

リアルタイム本番処理の場合と同様、非リアルタイム訓練のための処理は、ある実施形態では、単一のＩＡからの入力を含み、他の実施形態では、複数のＩＡからの入力を含む。異なるＩＡによって選択された意図の違いは、多大な追加の訓練を必要とする特に微妙な発話を示す可能性があるので、これらの違いは、ＡＳＲを訓練する際に非常に役立つ。用件意図が、「はい」または「いいえ」などのごくわずかなオプションしかない小さい文法を有することができ、「はい」および「いいえ」における発話の事前パッケージ済みの理解がＡＳＲに付属しているような、最も単純な形では、訓練は、文法整調に使用できる統計モデルを構築することからなる場合がある。より複雑な訓練では、言われる可能性のある発話の統計言語モデルを構築するために、領域知識を用いてＡＳＲの単語認識が援助される。 As with the real-time production process, the process for non-real-time training includes input from a single IA in some embodiments, and input from multiple IAs in other embodiments. These differences are very useful in training ASR, as the differences in intentions selected by different IAs can indicate particularly subtle utterances that require a great deal of additional training. The business intent can have a small grammar with very few options such as “yes” or “no”, and a pre-packaged understanding of utterances in “yes” and “no” is attached to the ASR In the simplest form, training may consist of building a statistical model that can be used for grammar pacing. In more complex training, domain knowledge is used to assist word recognition in ASR to build a statistical language model of utterances that can be said.

好ましい一実施形態では、ＩＶＲシステム１００は、サポートシステム１０８中の複数の利用可能なリアルタイムＡＳＲを使用して実現される。実際には、各ＡＳＲが強みと弱みを有することが見出され、特定エリアでの成功は、特定の状況でどのＡＳＲを使用するかを決定するためにｉルータ１０１によって使用可能であり、また、特定の状況での訓練からどのＡＳＲが利益を受けることができるかを決定するために訓練サブシステム７１０によって使用可能である。現在利用可能なＡＳＲは、カーネギーメロン大学（Sphinx）、Nunance、Dragon、Loquendo（登録商標）、Lumenvox、ＡＴ＆Ｔ（登録商標）、SRI International、Nexidia、Ｍｉｃｒｏｓｏｆｔ（登録商標）およびＧｏｏｇｌｅ（登録商標）からのＡＳＲを含む。厳選されたＡＳＲのみがコストなしで利用可能（例えばオープンソースライセンスの下で）なので、経済的な考慮事項により、サポートシステム１０８に含めるＡＳＲの数が制限される場合がある。ｉルータ１０１は、いずれか特定のコンテキストでうまく機能すると予想されるＡＳＲに本番要求を選択的にルーティングすることができるので、かつ、訓練サブシステム７１０も同様に、リアルタイムＡＳＲをそれらの性能の予想される向上に基づいて選択的に訓練することができるので、相互にいくぶん直交する性能特性を有する１群のＡＳＲを選択するのがしばしば有利であろう。このようにすれば、あるＡＳＲが別のＡＳＲの弱みを埋め合わせることを期待することができる。例えば、電話の言葉を処理するのに最適化されたＡＳＲは、ディクテーション機器からの言葉を対象に設計されたＡＳＲとはかなり異なる性能特性を有する場合がある。 In one preferred embodiment, the IVR system 100 is implemented using multiple available real-time ASRs in the support system 108. In practice, each ASR is found to have strengths and weaknesses, and success in a particular area can be used by the i-router 101 to determine which ASR to use in a particular situation, and Can be used by the training subsystem 710 to determine which ASRs can benefit from training in a particular situation. Currently available ASRs are from Carnegie Mellon University (Sphinx), Nunance, Dragon, Loquendo (R), Lumenvox, AT & T (R), SRI International, Nexidia, Microsoft (R) and Google (R). Includes ASR. Because only carefully selected ASRs are available at no cost (eg, under an open source license), economic considerations may limit the number of ASRs included in the support system 108. Because iRouter 101 can selectively route production requests to ASRs that are expected to work well in any particular context, and the training subsystem 710 similarly uses real-time ASRs to predict their performance. It can often be advantageous to select a group of ASRs that have performance characteristics that are somewhat orthogonal to each other, since they can be selectively trained based on the improvements made. In this way, one ASR can be expected to make up for the weaknesses of another ASR. For example, an ASR that is optimized for processing telephone words may have significantly different performance characteristics than an ASR that is designed for words from dictation equipment.

ＩＶＲシステム１００で使用されるリアルタイムＡＳＲの精度を高めるために、訓練サブシステム７１０は、訓練ＡＳＲ７１１の非リアルタイム動作に基づいて、受信した各発話の意味に特有の訓練をリアルタイムＡＳＲに提供することによって、機械学習を容易にする。 To increase the accuracy of real-time ASR used in IVR system 100, training subsystem 710 provides real-time ASR with training specific to the meaning of each received utterance based on the non-real-time behavior of training ASR 711. Make machine learning easy.

一般に、ＡＳＲはいくつかの異なる態様で訓練される。第１に、ＡＳＲは、オーディオストリーム、およびオーディオストリームの各部分を、話されている単語の認識に至るための助けになれる構成要素に分類できなければならない。通常、これは、「音（phone）」として知られる類似するサウンドクラスと、「ダイフォン（diphone）」として知られるサウンド移行または結合と、「セノン（senone）」と一般に呼ばれる、より複雑な場合のある波形部分とのセットを、オーディオストリーム内で識別することを伴う。一般に、発話は、沈黙期間が検出される場所ではどこでも分割される。発話フレーム（１０ミリ秒の時間フレームなど）を分割して、この時間フレーム内でオーディオの様々な異なる特徴面（振幅および周波数が増加しているか、一定であるか、または減少しているかなど）を抽出することによって、発話から特徴が導出される。カーネギーメロン大学から入手可能なＳｐｈｉｎｘＡＳＲでは、３９個の特徴が抽出されて、音声が「特徴ベクトル」として表される。通常、ＡＳＲエンジンには、それらの認識が固定されるというこの側面が伴い、このようなシステムのユーザは、どの特徴が分析されるか、またはどのようにそれらが分析されるかを変更することはできない。 In general, ASR is trained in several different ways. First, the ASR must be able to categorize the audio stream and each part of the audio stream into components that can help to recognize the spoken words. Typically this is a similar sound class known as “phone”, a sound transition or combination known as “diphone”, and a more complex case commonly called “senone”. It involves identifying a set with a waveform portion in the audio stream. In general, an utterance is split everywhere a silence period is detected. Divide a speech frame (such as a 10 millisecond time frame), and within this time frame, various different features of the audio (such as whether the amplitude and frequency are increasing, constant, or decreasing) Is extracted from the utterance. In the Spinx ASR available from Carnegie Mellon University, 39 features are extracted and the speech is represented as a “feature vector”. Typically, ASR engines involve this aspect that their perception is fixed, and users of such systems change which features are analyzed or how they are analyzed. I can't.

ＡＳＲは、様々なモデルを使用して、生オーディオ波形から、発話に対応する単語の予測に進む。音響モデルは、受信したセノンに対する最も確率の高い特徴／特徴ベクトルを決定する。音声モデルは、音と単語をマッピングするが、単語は、固定辞書からくるものであるか、または、機械学習によって導出された語彙（若しくは「文法」）からくるものである。言語モデルは、前に認識された単語など、何らかのコンテキストに基づいて、候補単語選択肢を制限する。ＡＳＲは通常、これらのモデルの組合せを使用して、どの単語が発話に対応するかを予測する。以下で考察する実施形態における訓練の焦点は、後の２つのモデル、すなわち音声モデルおよび言語モデルだが、本明細書で対象とする概念は、音声認識で使用される他のモデルにも容易に適用することができる。 ASR uses various models to proceed from the raw audio waveform to prediction of the word corresponding to the utterance. The acoustic model determines the most probable feature / feature vector for the received senone. A speech model maps sounds and words, but the words come from a fixed dictionary or come from a vocabulary (or “grammar”) derived by machine learning. The language model restricts candidate word choices based on some context, such as previously recognized words. ASR typically uses a combination of these models to predict which words correspond to utterances. The focus of training in the embodiments discussed below is the latter two models: a speech model and a language model, but the concepts covered here are easily applied to other models used in speech recognition. can do.

多くの場合、ＡＳＲの訓練は、前に認識された単語からのコンテキストを使用することによって、またはリアルタイムでない処理（すなわち、同じ顧客談話において後で認識された単語）のコンテキストを使用することによって、より効果的に達成することができる。このような訓練について以下に述べる。 In many cases, ASR training is accomplished by using context from previously recognized words or by using non-real-time processing (ie, words that are later recognized in the same customer discourse). It can be achieved more effectively. Such training is described below.

まず音声モデルに目を向け、「I would like to fly roundtrip between Boston and San Diego.（ボストンとサンディエゴの間を往復して飛びたい。）」というユーザ発話を考えてみる。「オフザシェルフ」ＡＳＲは、これらの単語のいくつかを様々な話者にまたがって認識するのに、いくらか困難を有する場合がある。例えば、単語「roundtrip」を発音する際、何人かの話者は、「ｄ」と「ｔ」の子音の音を１つの音に省略する（rountrip）ことがあるが、他の話者は、これらを別々に発音する（これらが２つの単語「round」と「trip」であるかのように）ことがある。 First, look at the voice model and consider the user utterance: “I would like to fly roundtrip between Boston and San Diego.” The “off the shelf” ASR may have some difficulty in recognizing some of these words across different speakers. For example, when pronouncing the word “roundtrip”, some speakers may omit the consonant sounds of “d” and “t” into one sound (rountrip), while other speakers These may be pronounced separately (as if they were the two words “round” and “trip”).

一実施形態では、訓練サブシステム７１０は、これらの問題の各々に対処することによって、非リアルタイム訓練ＡＳＲ７１１に機械学習を提供する。まず、訓練サブシステム７１０は、発話が最初に受信されたときにＩＡ１０５によって決定された、発話に対応する用件意味に基づいて、ターゲット語彙を選択する。この場合、ＩＡは「新規予約」を用件意味として選択した可能性が高い。単語「roundtrip」は、一般的な文法においては４万個の単語のうちの１つであったかもしれず、ごく低い統計発生率を有したかもしれないが、「新規予約」の意図に特有の文法においては、たった千個の単語のうちの１つかもしれず、はるかに高い統計発生率を有するかもしれない。従って、訓練サブシステム７１０は、特徴ベクトルがこの単語の標準化モデルからかなり逸脱するとしても、適用可能な文法を変更することによって、話されたこととして単語「roundtrip」を訓練ＡＳＲ７１１が受諾する確率を大幅に上げる。さらに、「roundtrip」の追加の発話が「新規予約」の意図に関連付けられるようになるのに伴い、これらの発話は、「roundtrip」が話された既に認識済みのインスタンスの少なくともいくつかと、より近く合致することになる可能性が高い。従って、時が経つにつれて、単語「roundtrip」が「新規予約」の意図の中で発生する可能性と、この単語の発音のばらつきとの両方が、以下の２つの結果につながることになる。すなわち、（ａ）単語を認識する際の確実性がより高くなること（これは、「予約のキャンセル」の意図に関連する文法など、同じ単語を含む他の文法にも伝搬させることができる）、および、（ｂ）単語が特定の意図にどれくらい頻繁に関連付けられるかに関する精緻化された統計によって、用件意図をよりよく予測できることである。 In one embodiment, the training subsystem 710 provides machine learning to the non-real time training ASR 711 by addressing each of these issues. First, the training subsystem 710 selects a target vocabulary based on the business meaning corresponding to the utterance determined by the IA 105 when the utterance was first received. In this case, there is a high possibility that the IA has selected “new reservation” as the message meaning. The word “roundtrip” may have been one of 40,000 words in general grammar and may have a very low statistical incidence, but it is specific to the intention of “new booking” May be one of only a thousand words and may have a much higher statistical incidence. Thus, the training subsystem 710 can increase the probability that the training ASR 711 will accept the word “roundtrip” as spoken by changing the applicable grammar even if the feature vector deviates significantly from the standardized model of this word. Raise significantly. In addition, as additional utterances of “roundtrip” become associated with the intention of “new booking”, these utterances are closer to at least some of the already known instances where “roundtrip” was spoken. It is likely that they will match. Therefore, as time passes, both the possibility of the word “roundtrip” occurring within the intent of “new booking” and the variation in pronunciation of this word will lead to the following two results. (A) More certainty when recognizing a word (this can be propagated to other grammars that contain the same word, such as a grammar related to the intention of “cancel reservation”) And (b) a refined statistic on how often a word is associated with a particular intent can better predict a business intent.

上述した発話の例に戻るが、早口の話者は、「Boston」と後続の単語「and」との間の区別を曖昧にして、全ての音をはっきり発音できないことがあり、それにより、訓練ＡＳＲ７１１は、音「Bostonan」を分析しようとしていることがある。同様に、都市名「San Diego」が、話者によっては、むしろ「Sandy A-go」のように聞こえるようにして発音されることがある。この場合もやはり、一般化された文法ではなく「新規予約」特有の文法を選択することで、「Boston」および「San Diego」の認識が信頼度を持って達成される統計的可能性が劇的に高まる可能性が高いことになる。一層の精緻化として、訓練サブシステム７１０は、ユーザ談話全体の発話の中を通る反復的パスを利用して、訓練をさらに一層改善する。上述の例では、その後、談話中に発信者は、文の最後に、訓練ＡＳＲ７１１によって容易に認識されるようにして「Boston」と言うことがある。「Boston」に関するこの話者の音響シグネチャが、ＡＳＲのマッピングに含められ、それにより、第２のパスでは、同じ話者の「Boston」発話は、前よりもよい「Boston」に対する合致と考えられることになる。同様に、話者は、２回目に、「San」と「Diego」との間でより区別を付けるようにして「San Diego」と言うことがあり、それにより反復的に認識を試みれば１回目の曖昧な発話がうまく認識される可能性がより高まることにつながる学習が提供される。長い顧客談話の場合、システムが認識できる単語を通して発信者の声特性がよりよく理解されるようになるので、複数の反復によって認識全体のかなりの改善に至ることができる。 Returning to the utterance example above, a fast-talker may not be able to pronounce all the sounds clearly, obscuring the distinction between “Boston” and the following word “and”, thereby training. ASR 711 may be trying to analyze the sound “Bostonan”. Similarly, the city name “San Diego” may be pronounced to sound like “Sandy A-go”, depending on the speaker. Again, the statistical possibility that the recognition of “Boston” and “San Diego” can be reliably achieved by selecting a grammar specific to “new booking” rather than a generalized grammar is also dramatic. This is likely to increase. As a further refinement, the training subsystem 710 uses a repetitive path through the utterance of the entire user discourse to further improve training. In the above example, during the discourse, the caller may then say “Boston” at the end of the sentence so that it can be easily recognized by the training ASR 711. This speaker's acoustic signature for "Boston" is included in the ASR mapping so that in the second pass, the same speaker's "Boston" utterance is considered a better match to "Boston" than before It will be. Similarly, the speaker may say “San Diego” the second time, making a more distinction between “San” and “Diego”, so that if it tries to recognize it repeatedly, 1 Learning is provided that leads to a higher likelihood of successfully recognizing the second ambiguous utterance. For long customer discourses, multiple iterations can lead to significant improvements in overall recognition, as the caller's voice characteristics become better understood through words that the system can recognize.

ここで図１０も参照するが、一実施形態では、意図分析者による実際の認識時点を使用して、オーディオストリームが、認識のための別々の発話に分解される（例えば訓練ＡＳＲ７１１によって）。具体的には、発話意図「I want to take a flight from」の認識時点（１００１、１００４）、データ部分「Boston」の認識時点（１００２、１００５）、およびデータ部分「San Diego」の認識時点（１００３、１００６）は全て、十分に異なり、従って、オーディオを認識のための別々の発話に分解するのを容易にするために、時間フレーム自体が使用可能である。場合によっては、ＩＡは、発話が完了する前（または後）に認識を提供することがあり（例えば、図１０の１００３に示すように、「San Diego」は、最後の「o」音の前にＩＡによって認識される）、従ってそのような場合は、時間フレームは、ＩＡによって提供された認識の後（または前）の適切な休止で終わるように調節される。可能性ある用件意図およびそれらを表すのに使用される典型的な単語の数は、意図認識文法を絞り込むのに使用可能であり、収集されるデータのタイプ（例えば都市名）は、データ認識文法を絞り込むのに使用可能である。 Referring now also to FIG. 10, in one embodiment, the audio stream is decomposed into separate utterances for recognition (eg, by training ASR 711) using the actual recognition time points by the intent analyst. Specifically, the recognition time (1001, 1004) of the utterance intention “I want to take a flight from”, the recognition time (1002, 1005) of the data portion “Boston”, and the recognition time of the data portion “San Diego” ( 1003, 1006) are all sufficiently different, so the time frame itself can be used to facilitate breaking the audio into separate utterances for recognition. In some cases, the IA may provide recognition before (or after) the utterance is complete (eg, “San Diego” is displayed before the last “o” sound, as shown at 1003 in FIG. Thus, in such cases, the time frame is adjusted to end with an appropriate pause after (or before) the recognition provided by the IA. Possible requirement intentions and the number of typical words used to represent them can be used to narrow down the intention recognition grammar, and the type of data collected (eg city names) Can be used to narrow down the grammar.

言語モデルに移るが、訓練システム７１０はやはり、用件意図を利用して訓練を援助する。例えば、ＩＡが「新規予約」の用件意図を示した場合、発話の中の単語「and」の少なくとも１つのインスタンスの前に１つの都市名がきて、後に別の都市名が続くことになる可能性が、統計的に非常に高いであろう。同様に、単語「from」または「to」が認識された場合、これらの単語の後に都市名が続く確率が統計的に非常に高いであろう。対照的に、ＩＡによって決定された用件意図が「座席指定」である場合、これらの同じ単語「from」および「to」は、隣接する都市名と相関することはめったにないが、そうではなく、近くの数字と文字の対に相関するであろう（例えば「I would like to change from seat 39B to seat 11A.（座席３９Ｂから座席１１Ａに変更したい。）」）。 Moving on to the language model, the training system 710 still uses training intent to assist in training. For example, if the IA indicates a business intention of “new reservation”, at least one instance of the word “and” in the utterance will be preceded by one city name followed by another city name. The probability will be very high statistically. Similarly, if the words “from” or “to” are recognized, the probability that these words will be followed by a city name will be statistically very high. In contrast, if the business intention determined by the IA is “seat assignment”, these same words “from” and “to” rarely correlate with adjacent city names, but , Will correlate to a nearby number and letter pair (for example, “I would like to change from seat 39B to seat 11A”).

このような言語モデル訓練はまた、ユーザの変化する言い回しに容易に適応することを可能にする。例えば、航空会社がイングランドへのサービスを開始した場合、航空会社は、同じ用件意図について、前に使用されていたのとは異なる言語を使用した要求を急に受け始めることがある。例えば、前の「I would like to fly roundtrip between Boston and San Diego.」の例は、英国人の顧客によって「I would like to book a return trip between Boston and London.」と話されるかもしれない。最初は、単語「book」は「新規予約」文法において高確率で現れないであろうが、この文法におけるこの単語の統計的使用は、追加の英国人顧客によってすぐに増加する。同様に、用語「return」の使用は、英国人顧客ベースの追加によって変化し、「新規予約」文法は、これを認識するように相応に調節される。 Such language model training also allows easy adaptation to the changing language of the user. For example, if an airline initiates service to England, the airline may suddenly begin to receive requests for the same business intent using a different language than was previously used. For example, the previous example of “I would like to fly roundtrip between Boston and San Diego.” Might be spoken by a British customer as “I would like to book a return trip between Boston and London.” Initially, the word “book” will not appear with high probability in the “new booking” grammar, but the statistical use of this word in this grammar is quickly increased by additional British customers. Similarly, the use of the term “return” will change with the addition of the British customer base and the “new booking” grammar will be adjusted accordingly to recognize this.

訓練サブシステム７１０はまた、用件意図と、談話の中の隣接する認識された単語との組合せに基づいて、認識候補についての統計を調節する。用件意図が「新規予約」であると決定され、また、最初、ユーザの談話の中の１つの発話のみが、使用可能な信頼度レベルでは認識できないという例を考えてみる。談話が都市名を１つだけ含んでいたと認識された場合、認識されなかった発話が別の都市名である確率が非常に高く、このシステムを使用する航空会社によって対応される都市名である確率はさらに高い。文法内の候補単語に対する確率を変更して部分的認識を行うと、いくつかの候補単語がそれ以上の考慮からうまく切り捨てられることがあり、１つの候補（おそらく都市名）だけが、使用可能な確実度レベルになることがある。この場合、機械学習は、この特定ユーザの都市の発音をＡＳＲのモデルに組み込み、それにより類似の発話の後続のインスタンスがより容易に認識されるようにする。 The training subsystem 710 also adjusts statistics for recognition candidates based on the combination of business intentions and adjacent recognized words in the discourse. Consider an example where the business intention is determined to be “new booking” and only one utterance in the user's discourse is initially unrecognizable at an available confidence level. If the discourse is recognized as containing only one city name, the probability that the unrecognized utterance is another city name is very high and is the city name supported by the airline using this system. The probability is even higher. Changing the probability for a candidate word in the grammar and performing partial recognition may cause some candidate words to be successfully truncated from further consideration, and only one candidate (possibly a city name) can be used May be at certainty level. In this case, machine learning incorporates this particular user's city pronunciation into the model of ASR so that subsequent instances of similar utterances are more easily recognized.

許容可能な用件意図ごとに別々の文法を維持することで、通常なら可能であるはずよりも迅速なＡＳＲの教授を訓練サブシステム７１０が提供するのが容易になる。例えば、発話「book」、「notebook」および「Bucharest」には、強い音声上の類似性がある。これらの意味のうちのどれがユーザの発話に対応するかの決定は、用件意図を考慮することによって大きく向上する。例えば、用件意図が「遺失物取扱所」である場合は、「book」（その名詞の意味の）および「notebook」（「notebook computer」におけるような）は、他のコンテキストの場合よりもずっと高い可能性で現れるであろう。用件意図が「新規予約」である場合は、「book」（その動詞としての意味の）もまた、非常に高い可能性で現れるであろう。同様に、用件意図が「新規予約」である場合は、「Bucharest」は、用件意図が例えば「座席選択」であった場合よりも、高い可能性で現れるであろう。 Maintaining a separate grammar for each acceptable requirement intention makes it easier for the training subsystem 710 to provide ASR teaching faster than would normally be possible. For example, the utterances “book”, “notebook”, and “Bucharest” have strong audio similarities. The determination of which of these meanings corresponds to the user's utterance is greatly improved by considering the business intention. For example, if the business intent is “Lost and Found”, “book” (for the noun) and “notebook” (as in “notebook computer”) are much more than in other contexts. It will appear with high possibility. If the business intention is “new booking”, then “book” (with its verb meaning) will also appear very likely. Similarly, if the business intention is “new reservation”, “Bucharest” will appear with a higher probability than if the business intention was “seat selection”, for example.

訓練ＡＳＲ７１１自体が十分に訓練された後は、用件意図と言語モデルとの間の相関を非常に頑強な方式で作り出すことができる。例えば、似たように聞こえる単語のマッピングの例示的な一部は、次のとおりとすることができる。 After the training ASR 711 itself is fully trained, the correlation between the business intention and the language model can be created in a very robust manner. For example, an illustrative part of a mapping of words that sound similar may be as follows:

訓練ＡＳＲ７１１は、サポートシステム１０８からのリアルタイムＡＳＲに勝る２つの利点を有するので、言語モデル統計を作り出すのに特によく適する。第１に、本番動作に使用されないので、リアルタイムで動作する必要はなく、従って、リアルタイム処理に使用されるだけの十分な素早さで認識を実施することが少なくとも比較的中程度のコンピューティングプラットフォーム上ではできないはずの、より複雑な認識アルゴリズムを利用することができる。これにより、訓練ＡＳＲ７１１は、サポートシステム１０８中のリアルタイムＡＳＲが認識できないであろう発話を認識することができる。第２に、訓練ＡＳＲ７１１は、顧客談話からの演繹的な情報だけでなく、帰納的な情報も利用することができる。従って、対話の中の全ての発話が分析されるまで待機し、次いで認識時に複数のパスをとることができ、おそらく、後の反復では、成功する可能性がより高くなる。前述のように、「Bostonan」のように聞こえる最初のユーザ発話は、２回目の「Boston」の発話の後には、はるかに容易に認識することができる。 Training ASR 711 is particularly well suited for producing language model statistics because it has two advantages over real-time ASR from support system 108. First, since it is not used for production operations, it does not need to operate in real time, and therefore it is at least on a relatively moderate computing platform to perform recognition fast enough to be used for real-time processing. You can use more complex recognition algorithms that wouldn't be possible. This allows training ASR 711 to recognize utterances that real-time ASR in support system 108 would not be able to recognize. Second, the training ASR 711 can use not only deductive information from customer discourse but also inductive information. Thus, it is possible to wait until all utterances in the dialog have been analyzed and then take multiple paths upon recognition, perhaps more likely to succeed in later iterations. As mentioned above, the first user utterance that sounds like "Bostonan" can be recognized much more easily after the second "Boston" utterance.

訓練ＡＳＲ７１１は、時の経過に伴って、関連する各用件意図と共に使用される言語要素に関係する一連の統計を構築する。一実施形態では、複数の訓練ＡＳＲ７１１が使用され、各訓練ＡＳＲ７１１は統計全体に貢献する。ある実施形態では、統計は認識に関する確実性の尺度を含み、この尺度は、単一の訓練ＡＳＲ７１１による認識の複数のインスタンスに基づくか、複数の訓練ＡＳＲ７１１間の一致に基づくか、又はこの両方に基づく。 The training ASR 711 builds up a series of statistics related to the language elements used with each relevant business intent over time. In one embodiment, multiple training ASRs 711 are used, and each training ASR 711 contributes to the overall statistics. In some embodiments, the statistics include a measure of certainty about recognition that is based on multiple instances of recognition by a single training ASR 711, based on a match between multiple training ASRs 711, or both. Based.

このようにして作り出された統計は、サポートシステム１０８中のリアルタイムＡＳＲのいずれかによって使用可能である。サポートシステム中の、リアルタイム認識に使用できる種々のＡＳＲの各々は、通常、訓練のためのそれ自体のメカニズムと、どのように言語モデルを訓練のためにこのメカニズムに入力できるかに関する対応する仕様とを有する。好ましい一実施形態では、訓練サブシステム７１０は、それが作り出す統計をサポートシステム１０８中のＡＳＲごとにフォーマットし、それにより、訓練サブシステム７１０によって生成された統計をこれらのＡＳＲの各々が利用できるようにする。実際には、ＡＳＲは、それらが訓練のためにサポートするメカニズムにおいて大きく異なり、従って、訓練アルゴリズム７１２は、既存の各ＡＳＲ、並びにサポートシステム１０８に追加される可能性のある新しい各ＡＳＲに適切な方式で、訓練データを収集し、フォーマットし、ＡＳＲに提供するように、容易に構成可能である。リアルタイムＡＳＲの性能は訓練に伴って向上するので、その認識の品質は、処理２１０、２１１でリアルタイムＡＳＲがＩＡ１０５の機能に取って代わるのを可能にすることができる。 Statistics generated in this way can be used by any of the real-time ASRs in the support system 108. Each of the various ASRs that can be used for real-time recognition in the support system typically has its own mechanism for training and a corresponding specification on how a language model can be entered into this mechanism for training. Have In a preferred embodiment, the training subsystem 710 formats the statistics it produces for each ASR in the support system 108 so that the statistics generated by the training subsystem 710 are available to each of these ASRs. To. In practice, ASRs vary greatly in the mechanisms they support for training, so the training algorithm 712 is appropriate for each existing ASR as well as each new ASR that may be added to the support system 108. In a manner, it can be easily configured to collect, format, and provide training data to the ASR. Since the performance of real-time ASR improves with training, the quality of its recognition can allow real-time ASR to replace the function of IA 105 in processes 210, 211.

訓練サブシステム７１０はまた、各ＡＳＲの機能と共に機能して、ＡＳＲ訓練がＩＶＲシステム１００中での使用に最大限に活用されるのを確実にする。例えば、ＡＳＲは、センテンスツリーを使用するなどして、いつ十分な発話部分が統計分析の実施に使用可能と認識されるかについての閾値の決定をサポートすることができ、訓練アルゴリズム７１２は、訓練の進展を決定するためにこのような特徴に適合するように構成される。 The training subsystem 710 also works with the functionality of each ASR to ensure that ASR training is maximized for use in the IVR system 100. For example, ASR can support the determination of a threshold for when enough utterances are recognized to be usable for performing statistical analysis, such as using a sentence tree, and training algorithm 712 can It is configured to fit such features to determine the progress of

サポートシステム１０８中のリアルタイムＡＳＲは、異なる統計処理を必要とする２つの異なる方法で使用される。第１の方式では、これらは、対応する用件意図をＩＡが決定した後で、プロセスを認識するのに使用される。例えば、１または複数のＩＡ１０５が、発信者によって話された文についての用件意図として「新規予約」を選択する場合があり、これに基づいて、サポートシステム１０８中の１または複数のリアルタイムＡＳＲが、発信者によって話された特定の単語を認識しようとすることになる。 Real-time ASR in support system 108 is used in two different ways that require different statistical processing. In the first scheme, they are used to recognize the process after the IA has determined the corresponding business intent. For example, one or more IAs 105 may select “new booking” as a business intention for a sentence spoken by the caller, based on which one or more real-time ASRs in support system 108 are Will try to recognize certain words spoken by the caller.

第２の方式では、ＩＡではなくリアルタイムＡＳＲを使用して用件意図が決定される。これは、発信者によって話された特定の単語を決定するのとは異なる認識タスクである。例えば、用件意図が「新規予約」である可能性があるか「座席要求」である可能性があるかを決定することは、「新規予約」に関する単語「から」および「まで」、並びに、「座席予約」に関する単語「通路側」および「窓側」など、各意図に特有の、可能性の高い少数のキーワードを認識することを伴うことがある。サポートシステム１０８中のあるタイプのＡＳＲは、用件意図を決定することに、よりよく適する場合があり、別のタイプのＡＳＲは、その用件意図に基づいて単語を認識することに、よりよく適する場合がある。一実施形態では、訓練サブシステム７１０によって提供される、リアルタイムＡＳＲごとの訓練統計のフォーマットは、リアルタイムＡＳＲが意図の決定に最適化されることになるか、または決定された意図に基づく単語認識に最適化されることになるかに基づいて、調節される。 In the second method, the business intention is determined using real-time ASR instead of IA. This is a different recognition task than determining the specific words spoken by the caller. For example, determining whether the business intent may be “new reservation” or “seat request” may include the words “from” and “to” for “new reservation”, and It may involve recognizing a small number of likely keywords specific to each intent, such as the words “passage side” and “window side” for “seat reservation”. One type of ASR in support system 108 may be better suited for determining the business intention, and another type of ASR may be better at recognizing words based on that business intention. May be suitable. In one embodiment, the format of the training statistics for each real-time ASR provided by the training subsystem 710 is such that the real-time ASR will be optimized for intent determination or word recognition based on the determined intent. It is adjusted based on what will be optimized.

訓練プロセスの一部は、機械学習がサポートシステム１０８中のリアルタイムＡＳＲに対してどれ位効果的であったかを決定することを含む。これは妥当性検査と呼ばれる。好ましい一実施形態では、妥当性検査は訓練サブシステム７１０によって実施される。代替的実施形態では、妥当性検査はｉルータ１０１または専用の妥当性検査プロセッサ（図示せず）によって実施される。妥当性検査では、ＡＳＲを、相互と、およびＩＡと並列で動作させて、それらの性能がどれ位匹敵するかを決定する。各訓練インスタンスは、ＩＡによって提供される用件意味ごとに文法使用の統計モデルおよび確率を作り出すのに使用される、より多くの情報を提供する。状況によっては、ＩＡからの履歴データもまた、発話に対して利用可能な場合のある予期される自動化レベルを決定する。ＩＡが、発話に対して複数の意味をいつも決まって提供する場合、ＡＳＲは、かなりのコンテキスト訓練が可能な場合にのみ使用可能となるであろう。頑強なコンテキスト処理を有するＡＳＲは、そのような発話を正しく処理できるかもしれないが、コンテキスト的に強くないＡＳＲは、どれだけ多くの訓練が提供されるかにかかわらず、最低閾値を満たすことができないかもしれない。例えば、発話「ＩＰ」は、「インターネットプロトコル（Internet Protocol）」または「知的所有権（Intellectual Property）」を意味する可能性がある。両方の意味が一般的である適用例で使用された場合、ＡＳＲが訓練後に２つの意味のうちのどちらが適切な意味かを導出できない限り、処理精度の誤りが予想されることになる。 Part of the training process involves determining how effective machine learning was for real-time ASR in support system 108. This is called validation. In a preferred embodiment, validation is performed by the training subsystem 710. In an alternative embodiment, the validation is performed by iRouter 101 or a dedicated validation processor (not shown). In validation, ASRs are run in parallel with each other and with the IA to determine how comparable their performance is. Each training instance provides more information that is used to create a statistical model and probability of grammar usage for each subject meaning provided by the IA. In some situations, historical data from the IA also determines the expected level of automation that may be available for the utterance. If the IA always provides multiple meanings for utterances, the ASR will only be usable if significant context training is possible. An ASR with robust context processing may be able to handle such utterances correctly, but an ASR that is not contextually strong can meet the minimum threshold regardless of how much training is provided. It may not be possible. For example, the utterance “IP” may mean “Internet Protocol” or “Intellectual Property”. When used in applications where both meanings are general, an error in processing accuracy will be expected unless ASR is able to derive which of the two meanings is appropriate after training.

訓練が進むにつれて、リアルタイムＡＳＲの性能は向上する。ＩＶＲシステム１００内でこのＡＳＲを特に使用する必要性を満たすほど統計的に安定した時点で、ＡＳＲは本番動作に配置される。例えば、発話についての用件意味を決定するように意図されたＡＳＲは、その性能がＩＡの性能に達するほど十分に訓練された時点まで、非本番モードでＩＡと並列で動作することができ、十分に訓練されたとき、本番動作に切り替えられて、処理２１０、２１１におけるＩＡの負担が軽減される。 As training progresses, real-time ASR performance improves. When the ASR is statistically stable enough to meet the need to specifically use this ASR within the IVR system 100, the ASR is placed in production operation. For example, an ASR intended to determine the business meaning of an utterance can operate in parallel with the IA in non-production mode until such time as its performance is sufficiently trained to reach that of the IA, When fully trained, the operation is switched to the actual operation, and the burden of IA in the processes 210 and 211 is reduced.

典型的な一実施形態では、リアルタイム本番処理と訓練処理の両方で、２人のＩＡからの入力が２つのＡＳＲに提供されて、精度が高められる。同じユーザ談話における同じ発話についての２人のＩＡからの入力が異なる場合、ある実施形態では、発話は、意味の決定のために第３のＩＡ（場合によってはＩＡの品質の程度に基づいて選択される）にサブミットされる。 In an exemplary embodiment, both real-time production processing and training processing provide input from two IAs to two ASRs for increased accuracy. If the input from two IAs for the same utterance in the same user discourse is different, in one embodiment, the utterance is selected based on a third IA (possibly based on the degree of quality of the IA) for semantic determination. Submitted).

妥当性検査を介して決定されるように、かつ環境の特質に基づいて決定されるように、ＡＳＲが一定閾値よりも高い精度レベルに達したとき、訓練処理は遷移する。例示的な一実施形態では、ＡＳＲは本番処理に使用されるが、訓練は前述のように継続する。求められるものがより厳しくない環境では、または利用可能なリソースがより少ない環境では、訓練は全て終わる。第３の環境では、訓練は継続するが、優先順位が下がる（例えば、訓練処理は、一定量の利用可能な処理キャパシティがあるときにのみ、またはＡＳＲの性能が一定程度まで劣化したことがわかったときにのみ、行われる）。 The training process transitions when the ASR reaches an accuracy level higher than a certain threshold, as determined through validation and as determined based on environmental characteristics. In one exemplary embodiment, ASR is used for production processing, but training continues as described above. All training ends in an environment that is less demanding or has fewer resources available. In the third environment, training continues, but priorities are lowered (eg, training processing only when there is a certain amount of processing capacity available, or the ASR performance has degraded to a certain degree. Only when you know.)

ある実施形態では、妥当性検査プロセッサが、ＡＳＲをテストしてそれらの性能レベルを決定するように構成される。妥当性検査は、ある実施形態では、訓練段階の後に続き、他の実施形態では、訓練と同時に実施される。妥当性検査からの結果に基づいて、ｉルータ１０１は、ＡＳＲおよびＩＡへのその発話割り当てを変更する。例えば、ＡＳＲが用件意味の決定においてＩＡと比較して十分にうまく機能することがわかった場合、ｉルータ１０１は発話を、ＩＡにルーティングするよりもはるかに頻繁にこのＡＳＲにルーティングする。有利にも、このようなルーティングは非常に適応可能かつ構成可能である。図３〜図５に関して使用した例に従うと、ｉルータ１０１は、性能統計に基づいて、ウェルカムメッセージの直後の応答解釈にはＩＡの方を選ぶことができ（図４Ｂ）、映画または食事についての応答解釈には第１のＡＳＲの方を選ぶことができ（図５Ａ）、座席指定や飛行機情報についての応答解釈には第２のＡＳＲの方を選んで、図５Ｂに示される他の選択肢を選択することができる。ある実施形態では、特定の解釈エリアごとに２つのＡＳＲ（２１０、２１１におけるように）が選択されて、精度が保証される。両方のＡＳＲが同じ解釈を提供する場合は、対応する応答がユーザに提供される。ＡＳＲが異なる場合は、２１７におけるように、発話はＩＡに提供されて、判決を介して意味が選択される。 In some embodiments, the validation processor is configured to test the ASRs to determine their performance level. Validation is followed in some embodiments after the training phase, and in other embodiments is performed concurrently with training. Based on the results from the validation, iRouter 101 changes its utterance assignment to ASR and IA. For example, if it is found that the ASR works sufficiently well in comparison with the IA in determining the business meaning, the i-router 101 routes utterances to this ASR much more frequently than to route to the IA. Advantageously, such routing is very adaptable and configurable. In accordance with the example used with respect to FIGS. 3-5, iRouter 101 can choose IA for interpreting the response immediately after the welcome message (FIG. 4B) based on performance statistics (FIG. 4B). The first ASR can be selected for response interpretation (FIG. 5A), the second ASR can be selected for response to seat designation and airplane information, and the other options shown in FIG. 5B can be selected. You can choose. In one embodiment, two ASRs (as in 210, 211) are selected for each specific interpretation area to ensure accuracy. If both ASRs provide the same interpretation, a corresponding response is provided to the user. If the ASR is different, the utterance is provided to the IA, as in 217, and the meaning is selected via judgment.

結果として、人間ＩＡは、ＡＳＲが適切に機能できない特定のときだけ必要とされ、処理は、業務基準に応じてＩＡの介入の後すぐにＡＳＲに戻ることができ、ＩＡは顧客談話に接続されたままでいる必要はない。訓練がＡＳＲを向上させることができる場合、訓練は、ＩＶＲシステム１００全体に対する多くの追加コストも他のオーバヘッドも課すことなく、ＡＳＲを向上させる。適切な自動応答がユーザに提供されるように、単一のユーザ発話を聞いてユーザの意味または意図を所定オプションのドロップダウンリストから選択すること以上には、人間の対話が関与する必要はない。 As a result, the human IA is needed only in certain cases where the ASR cannot function properly, and processing can return to the ASR immediately after IA intervention, depending on business standards, and the IA is connected to the customer discourse. There is no need to remain. If training can improve ASR, training improves ASR without imposing many additional costs or other overhead on the overall IVR system 100. No human interaction needs to be involved beyond listening to a single user utterance and selecting the user's meaning or intention from a drop-down list of predefined options so that an appropriate automatic response is provided to the user. .

図８を参照すると、ＡＳＲ訓練に関する例示的な処理フロー８００が示されている。ユーザ発話を含むディジタル化されたオーディオストリームが、１または複数のＩＡ１０５に提供され（８０１）、また、図７に関して述べたように使用可能な意図応答をＩＡが提供できる場合は、オーディオストリームは訓練ＡＳＲ７１１に提供される。訓練ＡＳＲ７１１がオーディオをそれに相当するテキストに変換するために発話を十分に認識（８０２）できない場合は、発話は廃棄され、訓練に使用されない。 Referring to FIG. 8, an exemplary process flow 800 for ASR training is shown. If the digitized audio stream containing the user utterance is provided to one or more IAs 105 (801) and the IA can provide a usable intent response as described with respect to FIG. 7, the audio stream is trained. Provided to ASR 711. If the training ASR 711 cannot fully recognize (802) the utterance to convert the audio to its corresponding text, the utterance is discarded and not used for training.

ＡＳＲ７１１が発話を十分に認識（８０２）できる場合は、図７に関して上述したように、統計モデル／整調文法（例えば、ＩＡによって提供された意味およびデータに対応する文法）が構築される（８０３）。ＡＳＲ７１１によって決定された、一定信頼度閾値未満である発話のいくつかについては、ＡＳＲ７１１による意図またはデータの認識をＩＡが検証するための追加の検証ループを利用することができる。認識が確認された場合は、処理は８０３について述べたように進むが、そうでない場合は、結果は廃棄される。 If ASR 711 can fully recognize the speech (802), a statistical model / pacing grammar (eg, a grammar corresponding to the meaning and data provided by the IA) is constructed (803) as described above with respect to FIG. . For some utterances determined by ASR 711 that are below a certain confidence threshold, an additional verification loop can be utilized for IA to verify the recognition of intent or data by ASR 711. If recognition is confirmed, processing proceeds as described for 803, otherwise the result is discarded.

次に、訓練ＡＳＲ７１１の性能が今や十分であるかどうか決定する（８０４）ためのテストが行われる。性能閾値は、適用例のクリティカル性に依存する場合がある。ヘルスケア適用例は、例えば無料旅行者情報サービスがエラーに対して耐性を有するであろうよりもずっと、エラー耐性が低い場合がある。性能閾値はまた、新しい単語または句が統計モデルに追加されるレートに依存する場合もある。性能が十分でない場合は、処理は戻って、ディジタル化（８０１）でき追加の訓練に使用できるさらに他の発話に備える。性能が十分である場合は、訓練の結果が適用されて、サポートシステム１０８のリアルタイムＡＳＲが、訓練から得られたモデルで構成され（８０５）、これらのリアルタイムＡＳＲは妥当性検査され、適切なら本番処理に使用される。 Next, a test is performed to determine if the performance of the training ASR 711 is now sufficient (804). The performance threshold may depend on the criticality of the application example. Healthcare applications may be much less error resistant than, for example, a free traveler information service would be resistant to errors. The performance threshold may also depend on the rate at which new words or phrases are added to the statistical model. If the performance is not sufficient, processing returns to prepare for further utterances that can be digitized (801) and used for additional training. If the performance is sufficient, the training results are applied and the real-time ASR of the support system 108 is composed of models obtained from training (805), and these real-time ASRs are validated and, if appropriate, production. Used for processing.

ある実施形態では、次いで訓練は完了と見なされる。ＡＳＲは、最初は暫定モードで、即ちＩＡのシャドーとして、オンラインにされる。ＡＳＲが、業務基準によって（例えば、ＡＳＲからの結果と１または複数のＩＡの結果とを比較することによって）決定されるように品質レベルを満たす場合は、ＡＳＲは、完全に本番で使用され始め、それにより処理２１０でＩＡに取って代わる。同様に、第２のＡＳＲの性能が測定され、このＡＳＲが認識において十分な品質を生む場合は、オンラインにされて、処理２１１で第２のＩＡに取って代わる。他の実施形態では、特定の環境によって決まる時点でさらにテスト８０６が行われて、ＡＳＲの性能が何らかの適用可能な最低閾値未満に下落したかどうか確認される。下落した場合は、フローは追加の訓練のために８０１に戻る。性能が容認可能である場合は、処理は８０６にループバックし、適切な時点でテストを繰り返す。多くの試行の後でも性能が容認可能閾値に達しない場合は、ある実施形態では、訓練は放棄される。 In some embodiments, the training is then considered complete. The ASR is initially brought online in a provisional mode, ie as an IA shadow. If the ASR meets a quality level as determined by business standards (eg, by comparing the results from the ASR with the results of one or more IAs), the ASR begins to be used fully in production. , Thereby replacing IA in process 210. Similarly, the performance of the second ASR is measured and if this ASR produces sufficient quality in recognition, it is brought online and replaces the second IA in process 211. In other embodiments, a further test 806 is performed at a time determined by the particular environment to determine if the ASR performance has dropped below some applicable minimum threshold. If it falls, the flow returns to 801 for additional training. If the performance is acceptable, the process loops back to 806 and repeats the test at the appropriate time. If performance does not reach an acceptable threshold even after many trials, in some embodiments, training is abandoned.

図９は、本明細書で参照されるコンピュータ／プロセッサのいずれかとして使用されるコンピュータ９００の例を示す高レベルのブロック図である。図示されているのは、チップセット９０４に結合された少なくとも１つのプロセッサ９０２である。チップセット９０４は、メモリコントローラハブ９２０および入出力（Ｉ／Ｏ）コントローラハブ９２２を備える。メモリコントローラハブ９２０にはメモリ９０６およびグラフィックスアダプタ９１２が結合され、グラフィックスアダプタ９１２には表示デバイス９１８が結合される。Ｉ／Ｏコントローラハブ９２２には、記憶デバイス９０８、キーボード９１０、ポインティングデバイス９１４およびネットワークアダプタ９１６が結合される。コンピュータ９００の他の実施形態は、異なるアーキテクチャを有する。例えば、ある実施形態では、メモリ９０６はプロセッサ９０２に直接に結合される。ある実施形態では、キーボード９１０、グラフィックスアダプタ９１２、ポインティングデバイス９１４および表示デバイス９１８などのコンポーネントは、直接人間対話を必要としないある種のコンピュータ９００（例えばある種のサーバコンピュータ）には使用されない。 FIG. 9 is a high-level block diagram illustrating an example of a computer 900 used as any of the computers / processors referred to herein. Shown is at least one processor 902 coupled to chipset 904. The chipset 904 includes a memory controller hub 920 and an input / output (I / O) controller hub 922. A memory controller hub 920 is coupled with a memory 906 and a graphics adapter 912, and a display device 918 is coupled with the graphics adapter 912. A storage device 908, a keyboard 910, a pointing device 914, and a network adapter 916 are coupled to the I / O controller hub 922. Other embodiments of the computer 900 have a different architecture. For example, in some embodiments, memory 906 is coupled directly to processor 902. In some embodiments, components such as the keyboard 910, graphics adapter 912, pointing device 914, and display device 918 are not used for certain computers 900 (eg, certain server computers) that do not require direct human interaction.

記憶デバイス９０８は、ハードドライブ、ＣＤ−ＲＯＭ、ＤＶＤ、またはソリッドステートメモリデバイスなどのコンピュータ可読記憶媒体である。メモリ９０６は、プロセッサ９０２によって使用される命令およびデータを保持する。ポインティングデバイス９１４は、マウス、トラックボールまたは他のタイプのポインティングデバイスであり、キーボード９１０と共に使用されてコンピュータシステム９００にデータを入力する。グラフィックスアダプタ９１２は、表示デバイス９１８上に画像および他の情報を表示する。ネットワークアダプタ９１６は、コンピュータシステム９００をインターネット１００１に結合する。コンピュータ９００のある実施形態は、図９に示すものとは異なるコンポーネントおよび／またはそれ以外のコンポーネントを有する。 The storage device 908 is a computer readable storage medium such as a hard drive, CD-ROM, DVD, or solid state memory device. Memory 906 holds instructions and data used by processor 902. Pointing device 914 is a mouse, trackball or other type of pointing device that is used with keyboard 910 to enter data into computer system 900. Graphics adapter 912 displays images and other information on display device 918. Network adapter 916 couples computer system 900 to the Internet 1001. Certain embodiments of the computer 900 have different and / or other components than those shown in FIG.

コンピュータ９００は、本明細書に述べる機能を提供するためのコンピュータプログラムモジュールを実行するように適合される。本明細書において、用語「モジュール」とは、指定された機能を提供するのに使用されるコンピュータプログラム命令および他のロジックを指す。従って、モジュールは、ハードウェア、ファームウェアおよび／またはソフトウェアにおいて実現することができる。一実施形態では、実行可能コンピュータプログラム命令で形成されるプログラムモジュールが、記憶デバイス９０８に記憶され、メモリ９０６にロードされ、プロセッサ９０２によって実行される。 Computer 900 is adapted to execute computer program modules for providing the functionality described herein. As used herein, the term “module” refers to computer program instructions and other logic used to provide a specified function. Thus, the module can be implemented in hardware, firmware and / or software. In one embodiment, program modules formed of executable computer program instructions are stored in storage device 908, loaded into memory 906, and executed by processor 902.

本明細書に述べるコンポーネントによって使用されるコンピュータ９００のタイプは、実施形態、およびエンティティによって使用される処理力に応じて異なる。例えば、顧客のコンピュータ１０３は通常、限られた処理力しか有さない。対照的に、ｉルータ１０１は、本明細書に記載の機能を提供するために共に働く複数のサーバを含む場合がある。ある適用例では、単一のプロセッサ（または１組のプロセッサ）が、サポートシステム１０８中のリアルタイムＡＳＲと、訓練サブシステム７１０の訓練ＡＳＲ７１１および他の機能との、両方を実現することができる。これらの適用例では、どれ位多くの訓練をいつ行うかを決定することで、比較的安価かつ適度に強力なコンピュータを、訓練と本番ＡＳＲ処理との両方に使用することができる。 The type of computer 900 used by the components described herein will vary depending on the embodiment and the processing power used by the entity. For example, customer computers 103 typically have limited processing power. In contrast, iRouter 101 may include multiple servers that work together to provide the functionality described herein. For certain applications, a single processor (or set of processors) can implement both real-time ASR in support system 108 and training ASR 711 and other functions of training subsystem 710. In these applications, determining when to do so much training allows a relatively inexpensive and reasonably powerful computer to be used for both training and production ASR processing.

前述のシステムおよび方法は、音声対話に適用可能であるだけでなく、実施形態によっては、例えばビデオ、テキスト、Ｅメール、チャット、写真および他の画像でも使用可能である。これら他の実施形態は、例えばオンラインチャット、セキュリティ監視、テーマパークコンシェルジュサービス、およびデバイスヘルプなどの適用例で使用可能である。具体的な例として、自由回答式の質問が前述のようにして解釈され処理されるヘルプ機構を、Ａｐｐｌｅ，Ｉｎｃ．によって提供されるｉＰｈｏｎｅ（登録商標）やｉＰａｄデバイスなどの消費者デバイスに提供することができる。同様に、前述の技法を使用して、ビデオストリームおよび画像の認識を容易にすることもできる。 The systems and methods described above are not only applicable to voice interaction, but may also be used with, for example, video, text, email, chat, photos, and other images in some embodiments. These other embodiments can be used in applications such as online chat, security monitoring, theme park concierge services, and device help. As a specific example, a help mechanism in which open-ended questions are interpreted and processed as described above is described in Apple, Inc. Can be provided to consumer devices such as iPhone (registered trademark) and iPad devices. Similarly, the techniques described above can be used to facilitate video stream and image recognition.

以上の考察から明白なように、顧客対話の一部を処理するのに、ＨＳＲサブシステムよりもＡＳＲサブシステムの方が適切であることもある。可能な最良のユーザ体験を提供するためには、アプリケーションプログラム（ワークフローリポジトリ１０６に記憶されたもの等）が音声認識リソースを求める場合に、このような認識に使用されるリソースの選択（即ち、ＡＳＲまたはＨＳＲ、並びに現在の認識タスクに最もよく適する特定のＡＳＲ／ＨＳＲリソースの選択）を最適化することによって、利益を達成することができる。 As is apparent from the above discussion, the ASR subsystem may be more appropriate than the HSR subsystem to handle some of the customer interactions. In order to provide the best possible user experience, when an application program (such as that stored in the workflow repository 106) seeks speech recognition resources, the selection of resources used for such recognition (ie, ASR). Alternatively, benefits can be achieved by optimizing the HSR, as well as the selection of specific ASR / HSR resources that best suit the current recognition task.

図１１を参照すると、適切な処理リソースのこのような選択を達成するためのＡＳＲプロキシ１１０２の動作のブロック図が示されている。より具体的には、以下に述べる機能は、様々な実施形態で、ボイス拡張可能マークアップ言語（ＶＸＭＬ）ブラウザ内でのメディアリソース制御プロトコル（ＭＲＣＰ）におけるカプセル化と、ウェブサービスと、アプリケーションプログラミングインタフェース（ＡＰＩ、例えばＪａｖａまたはＣ＃言語で書かれたもの）とのうちの、１または複数によって実現される。特定の一実施形態では、様々なベンダからの共通ＡＳＲが、ＶＸＭＬプラットフォーム（ブラウザ）への標準インタフェースとしてＭＲＣＰを使用し、この環境では、ＡＳＲプロキシ１１０２は、ＶＸＭＬプラットフォームと共に実行されるソフトウェアアプリケーション１１０１にとってはＡＳＲエンジンに見えるように構成されるが、そうではなく、ＡＳＲサブシステムとＨＳＲサブシステムの両方からの音声認識リソースを提供することによって、ＶＸＭＬアプリケーションと音声認識機能との間のプロキシとしての働きをする。 Referring to FIG. 11, a block diagram of the operation of ASR proxy 1102 to achieve such selection of appropriate processing resources is shown. More specifically, the functions described below include, in various embodiments, encapsulation in Media Resource Control Protocol (MRCP) within a Voice Extensible Markup Language (VXML) browser, web services, and application programming interfaces. (API, for example, written in Java or C # language). In one particular embodiment, a common ASR from various vendors uses MRCP as a standard interface to the VXML platform (browser), and in this environment, the ASR proxy 1102 is for software applications 1101 running with the VXML platform. Is configured to be visible to the ASR engine, but instead acts as a proxy between the VXML application and the speech recognition function by providing speech recognition resources from both the ASR subsystem and the HSR subsystem. do.

後でより詳細に述べるように、ＡＳＲプロキシ１１０２は、１または複数のＡＳＲサブシステム１１０４（サポートシステム１０８の考察に関して上述したものなど）またはＨＳＲサブシステム１１０６（オフサイトエージェント１０５の考察に関して上述したものなど）を自由に選択するように構成される。統計のデータベースサブシステム１１０５に基づいて、ＡＳＲプロキシ１１０２は、認識決定エンジン１１０３（この動作については図１２に関してさらに述べる）および結果決定エンジン１１０７（この動作については図１３〜図１６に関してさらに述べる）と通信して、いずれか特定の時点でどのＡＳＲ／ＨＳＲリソース１１０４、１１０６を利用するかに関する決定を行う。いずれかのＨＳＲリソースが使用のために選択された場合は、オフサイトエージェント１０５に関して上述したように、対応するユーザインタフェース情報が、適切なＨＳＲデスクトップワークステーション１１０８に提供される。 As will be described in more detail below, the ASR proxy 1102 may be one or more ASR subsystems 1104 (such as those described above with respect to support system 108 considerations) or HSR subsystem 1106 (as described above with reference to offsite agent 105 considerations). Etc.) is freely selected. Based on the statistical database subsystem 1105, the ASR proxy 1102 is configured with a recognition decision engine 1103 (this operation is further described with respect to FIG. 12) and a result determination engine 1107 (this operation is further described with respect to FIGS. 13-16). Communicate and make a decision regarding which ASR / HSR resources 1104, 1106 to use at any particular time. If any HSR resource is selected for use, corresponding user interface information is provided to the appropriate HSR desktop workstation 1108 as described above with respect to the offsite agent 105.

ＡＳＲプロキシ１１０２は、発話がＡＳＲによって認識されるべきかまたはＨＳＲによって認識されるべきかをソフトウェアアプリケーション１１０１の開発者が考慮する必要性を軽減する。従って、このようなソフトウェア開発者は、コンピュータで従来使用されてきたものよりも人間らしい音声ユーザインタフェースを構築する（かつその利用可能性を想定する）ことができる。 ASR proxy 1102 alleviates the need for developers of software application 1101 to consider whether utterances should be recognized by ASR or HSR. Thus, such a software developer can build (and assume its availability) a more human voice user interface than has been conventionally used on computers.

図１１をより詳細に参照すると、様々な実施形態で、ソフトウェアアプリケーション１１０１は、様々な目的を果たす。一実施形態では、ソフトウェアアプリケーション１１０１は、フリーダイヤル発信者補助のためのＩＶＲシステムであり、別の実施形態では、タブレットコンピュータ上の対話式ヘルプアプリケーションである。ソフトウェアアプリケーション１１０１は、何を認識すべきかをＡＳＲプロキシ１１０２に教える（すなわち文法をＡＳＲプロキシ１１０２に提供する）こと、並びに発話（通常は、．ｗａｖファイルなどのオーディオファイル、またはリアルタイムオーディオストリーム（例えばＭＲＣＰリアルタイムプロトコルストリーム））をそれに提供することによって、ＡＳＲプロキシ１１０２に指示する。ＡＳＲプロキシ１１０２は、予想されるように、発話を正しく認識したというＡＳＲの信頼度を示す信頼度スコアと共に、認識したものの「テキスト」または意味で応答する。 Referring to FIG. 11 in more detail, in various embodiments, software application 1101 serves various purposes. In one embodiment, software application 1101 is an IVR system for toll-free caller assistance, and in another embodiment is an interactive help application on a tablet computer. The software application 1101 tells the ASR proxy 1102 what to recognize (ie, provides the grammar to the ASR proxy 1102) and utterance (usually an audio file such as a .wav file, or a real-time audio stream (eg, MRCP Instruct the ASR proxy 1102 by providing it with a real-time protocol stream)). As expected, the ASR proxy 1102 responds with a “text” or meaning of what it recognizes, along with a confidence score that indicates the confidence of the ASR that it correctly recognized the utterance.

ＡＳＲプロキシ１１０２は、従来のＡＳＲとは異なる機能を有することができるので、ＡＳＲプロキシ１１０２は、例えば統計および決定に関する文法メタタグ中にある追加情報を必要とする場合がある。この追加情報は、プロンプトおよび文法を識別するための固有の方式、現セッションを識別するための固有の方式、「声」またはユーザを識別するための固有の方式（話者の音響モデルの学習を継続するため）、並びに、ＡＳＲプロキシ１１０２の挙動を指定するための閾値などである。ある適用例では、文法は事前定義済みまたは組込みである。他の適用例では、文法は組込みではなく、従って、文法に関係するメタ情報（エージェントの決定を枠組みにはめたりガイドしたりするためのユーザインタフェース情報など）が提供されて、可能性ある応答がよりよく定義される（例えばＨＳＲサブシステムの場合）。 Since the ASR proxy 1102 can have different functionality than the traditional ASR, the ASR proxy 1102 may require additional information, for example in grammar meta tags for statistics and decisions. This additional information includes a unique method for identifying prompts and grammars, a unique method for identifying the current session, a unique method for identifying “voices” or users (learning the speaker's acoustic model). And a threshold value for specifying the behavior of the ASR proxy 1102. For some applications, the grammar is predefined or built-in. In other applications, the grammar is not built-in, so meta-information related to the grammar (such as user interface information to frame and guide agent decisions) is provided and possible responses are Better defined (eg in the case of the HSR subsystem).

ソフトウェアアプリケーション１１０１が、発話を認識するようＡＳＲプロキシ１１０２に要求すると、ＡＳＲプロキシ１１０２は、処理を認識決定エンジン１１０３に渡す。認識決定エンジン１１０３は、どのように発話を認識するかを決定することを担う。例えば、ソフトウェアアプリケーション１１０１によって提供されるパラメータおよび信頼度閾値が、この決定に影響を及ぼすことがある。具体的な例として、極めて高い認識品質を適用例が必要とする場合、認識決定エンジン１１０３は、認識がＨＳＲリソース１１０６のみによって達成されるよう指示することができる。他方、適用例はコストが最も重要だと考える場合もあり、その結果、デフォルトではＡＳＲリソース１１０４のみが使用されるよう指示して、ＨＳＲリソース１１０６の使用は、ＡＳＲを使用するとエラーが多くなる場合のみに取っておくこともできる。 When the software application 1101 requests the ASR proxy 1102 to recognize the utterance, the ASR proxy 1102 passes the processing to the recognition determination engine 1103. The recognition decision engine 1103 is responsible for determining how to recognize the utterance. For example, parameters and confidence thresholds provided by software application 1101 may affect this determination. As a specific example, if the application requires very high recognition quality, the recognition decision engine 1103 can instruct that recognition be achieved only by the HSR resource 1106. On the other hand, the application may consider cost as the most important, and as a result, by default only the ASR resource 1104 is instructed to be used, and the use of the HSR resource 1106 is error prone when using ASR You can also keep it alone.

一実施形態では、認識決定エンジン１１０３は、適用例の特定の要件を満たすように適切な閾値を変動させて、同様の決定を自動的かつ動的に行う。従って、資産の多い銀行顧客には、高い品質閾値を使用することができ、一方消費者からの公共料金支払いの問合せには、より低い容認可能閾値が与えられる。この実施形態では、閾値は、過去の認識試行に基づいて計算された履歴統計に基づく。 In one embodiment, the recognition decision engine 1103 automatically and dynamically makes similar decisions by varying the appropriate threshold to meet the specific requirements of the application. Thus, high quality thresholds can be used for wealthy bank customers, while utility billing queries from consumers are given a lower acceptable threshold. In this embodiment, the threshold is based on historical statistics calculated based on past recognition attempts.

ＡＳＲリソースの使用とＨＳＲリソースの使用との間で選択することによってだけでなく、このようなリソースの組合せを選択できるようにすることによっても、有益な結果が得られることがわかっている。例えば、あるパラメータセットは、複数のＡＳＲリソースによって認識されるように発話をサブミットすることによって最もよく満たされ、別のパラメータセットは、単一の特定ＡＳＲに発話をサブミットすることによって最もよく満たされ、さらに別のパラメータセットは、ＡＳＲリソースとＨＳＲリソースの混合に発話をサブミットすることによって最もよく満たされる場合がある。実際には、ＡＳＲが訓練または整調された程度（例えば上記の訓練に関する考察のとおり）、ＡＳＲが特定の文法について妥当性検査されたかどうか、複数の認識経路のコストが容認可能かどうか、および履歴結果などの事柄は全て、いずれか特定の状況でどのリソースを適用するかを決定する際に役立つ。 It has been found that beneficial results can be obtained not only by choosing between using ASR resources and HSR resources, but also by allowing such combinations of resources to be selected. For example, one parameter set is best met by submitting utterances to be recognized by multiple ASR resources, and another parameter set is best met by submitting utterances to a single specific ASR. Yet another parameter set may be best met by submitting utterances to a mixture of ASR and HSR resources. In practice, the extent to which the ASR has been trained or tuned (eg, as discussed above for training), whether the ASR has been validated for a particular grammar, whether the cost of multiple recognition paths is acceptable, and history All things such as results are helpful in deciding which resources to apply in any particular situation.

同様に、発話に関係するセキュリティメタタグも、最も適切な認識リソースを決定するのに役立つ。例えば、発話が社会保障番号であることを示すメタタグを、ＡＳＲリソースによって処理されるように送って、人物に関する個人情報を人間が入手する可能性を回避することができる。 Similarly, security meta tags related to utterances can help determine the most appropriate recognition resource. For example, a meta tag indicating that the utterance is a social security number can be sent to be processed by an ASR resource to avoid the possibility of a person obtaining personal information about a person.

ある実施形態で考慮される別のパラメータは、様々なシステムリソースのアクティビティのレベルである。人間のスタッフに多量の要求が溜まっている場合、この未処理要求は、ＡＳＲリソースの使用を増加させることの方を選ぶためのパラメータとして使用可能である。 Another parameter considered in certain embodiments is the level of activity of various system resources. If there is a large demand on human staff, this outstanding request can be used as a parameter to choose to increase the use of ASR resources.

ある実施形態では、同じタイプであろうと異なるタイプであろうと複数のリソースを使用して、結果の二重チェックが提供される。 In one embodiment, a double check of results is provided using multiple resources, whether of the same type or different types.

さらに別の実施形態では、認識決定エンジン１１０３は、現在のオーディオストリームの長さを動的に把握しており、これを、対応する文法によって定義される予想される発話の長さと比較する。例えば、発話が、「赤（red）」、「緑（green）」、「青（blue）」の３色のうちの１つだけからなる文法を有すると予想され、実際の発話の長さが３秒である場合、発話が文法中の予想される単一音節の色のうちの１つでないという予期に基づいて、発話をＡＳＲリソースに認識させるという前の決定を変更し、ＡＳＲに加えてまたはＡＳＲに代えてＨＳＲリソースに認識させることができる。このような手法は、「意外な」発話を認識するための最終的な時間を最小限にし、従ってＡＳＲプロキシ１１０２の全体的な効率を高めることがわかっている。 In yet another embodiment, the recognition decision engine 1103 dynamically keeps track of the current audio stream length and compares it to the expected utterance length defined by the corresponding grammar. For example, an utterance is expected to have a grammar consisting of only one of the three colors "red", "green", and "blue", and the actual utterance length is If it is 3 seconds, based on the expectation that the utterance is not one of the expected single syllable colors in the grammar, the previous decision to make the ASR resource aware of the utterance is modified and added to the ASR Alternatively, the HSR resource can be recognized instead of the ASR. Such an approach has been found to minimize the final time to recognize “unexpected” utterances and thus increase the overall efficiency of the ASR proxy 1102.

上述したように、ＡＳＲプロキシ１１０２および対応するエンジン１１０３、１１０７の動作は、システムを個人化するための統計、閾値および他の固有情報を広く利用して、ソフトウェアアプリケーション１１０１のニーズに対応する。この情報は、図１１に示すように統計データベース１１０５に記憶される。例えば、ＡＳＲの動作の結果は、信頼度スコア統計としてデータベース１１０５に記憶され、このＡＳＲについての総統計は、ソフトウェアアプリケーション１１０１によって必要とされる適用可能な業務規則または他の規則の下でこのＡＳＲが使用可能かどうかに関して考慮される。さらに、話者、プロンプト、文法、適用例、認識方法（例えばＡＳＲ、ＨＳＲ、単一ＡＳＲ、複数ＡＳＲ、複数ＨＳＲ）、信頼度、合致なしまたは入力なし、および訓練／整調など、発話に関するあらゆる統計は、ＡＳＲプロキシ１１０２によってデータベース１１０５に記憶される。 As described above, the operation of the ASR proxy 1102 and the corresponding engines 1103, 1107 widely utilizes statistics, thresholds and other unique information for personalizing the system to address the needs of the software application 1101. This information is stored in the statistical database 1105 as shown in FIG. For example, the results of the ASR operation are stored in the database 1105 as confidence score statistics, and the total statistics for this ASR are subject to this ASR under applicable business rules or other rules required by the software application 1101. Is considered as to whether it can be used. In addition, all statistics related to speech, such as speaker, prompt, grammar, application, recognition method (eg ASR, HSR, single ASR, multiple ASR, multiple HSR), confidence, no match or no input, and training / pacing Is stored in the database 1105 by the ASR proxy 1102.

前の図に関して述べたのと同様にして、発話に対する使用可能な結果をＡＳＲが提供できなかった場合は、発話は、認識／不一致解決のために、ＨＳＲリソースに送られる。統計は、ＡＳＲだけでなくＨＳＲについても維持され、統計はさらに、個別の話者ベースでも維持される。従って、ＡＳＲが特定の話者の認識において特に効果的であることがわかった場合、同じ話者からの後の発話にこのＡＳＲが使用される可能性を増大させるために、統計が維持および更新される。同様に、統計は、個別の文法ベースでも維持され、それにより、この場合もやはり、予想される文法、またはプロンプト／文法の組合せに基づいて、使用する適切なリソースを認識決定エンジンが選ぶ可能性が最大化される。例えば、「はい／いいえ」文法は、「あなたはジョンスミスですか？」など、ＡＳＲによる単純なプロンプトの認識にはより効果的であろうが、「今日は先週の同じ日に比べて気分がいいですか？」など、より複雑な質問には、効果がより低いであろう。 As described with respect to the previous figure, if the ASR fails to provide a usable result for the utterance, the utterance is sent to the HSR resource for recognition / mismatch resolution. Statistics are maintained for HSR as well as ASR, and statistics are also maintained on an individual speaker basis. Thus, if an ASR is found to be particularly effective in recognizing a particular speaker, statistics are maintained and updated to increase the likelihood that this ASR will be used for later utterances from the same speaker. Is done. Similarly, statistics are also maintained on an individual grammar basis, which again allows the recognition decision engine to choose the appropriate resource to use based on the expected grammar or prompt / grammar combination. Is maximized. For example, the “yes / no” grammar would be more effective at recognizing simple prompts by ASR, such as “Are you John Smith?”, But “I feel better today than the same day last week. More complex questions, such as “Okay?” Would be less effective.

上記から一般化すると、統計は、様々な根拠で生成され、いつ特定のＡＳＲ／ＨＳＲリソースを使用するかについてインテリジェントな決定が行われるように維持される。信頼度レベルに基づいて、高信頼度のＡＳＲ認識が可能な文法を、より頻繁にソフトウェアアプリケーション１１０１によって使用することすらできる。例えば、「はい」または「いいえ」文法は、単純なＡＳＲリソースでは信頼度が非常に高いであろう。統計は、「あなたの電話番号を（５５５）１２３−４５６７として頂戴しておりますが正しいでしょうか？」などの単純な確認ステートメントから、「この１週間、気分がよかった場合は「はい」と言って下さい。気分が全く優れなかった場合は「いいえ」と言って下さい。」などのより複雑なコミュニケーションまでの、プロンプト／文法の組合せに関して記録される。 Generalizing from the above, statistics are generated on a variety of grounds and maintained so that intelligent decisions are made about when to use a particular ASR / HSR resource. Based on the confidence level, a grammar capable of highly reliable ASR recognition can even be used by the software application 1101 more frequently. For example, a “yes” or “no” grammar would be very reliable with simple ASR resources. According to the statistics, a simple confirmation statement such as "Is your phone number (555)123-4567 correct?" Says "Yes if you feel good in the past week" Please. Say “No” if you are not feeling well. Are recorded for prompt / grammar combinations, up to more complex communications.

本明細書における文法に関する考察は、文法とプロンプトの組合せに拡張可能かつ一般化可能である。例えば、ある統計は、現在のセッションにおける現在の話者の１組の発話（すなわち複数のプロンプトにわたる）についての全体的な信頼度に関係する。ＡＳＲ認識がプロンプト／文法の組合せにかかわらず話者に対して失敗している場合、このことは、ＡＳＲプロキシ１１０２が、この話者に対してはＡＳＲを試みるどころかＨＳＲに頼る方がよいであろうことを示す。他方、特定の話者の発話が、強い信頼度をいつも決まって示している場合、ＡＳＲプロキシは、好ましい認識方法としてＡＳＲを使用する。特定のセッションを超えて一般化するために、固有の話者参照ＩＤにより、システムは、特定の話者を認識して（例えば、システムと接続するのに使用された電話番号に基づいて）、適切なＡＳＲリソースまたはＨＳＲリソースを選ぶことができる。 The grammar considerations herein can be extended and generalized to a combination of grammars and prompts. For example, one statistic relates to the overall confidence level for a set of utterances (ie across multiple prompts) for the current speaker in the current session. If ASR recognition fails for the speaker regardless of the prompt / grammar combination, this would be better for the ASR proxy 1102 to rely on the HSR rather than attempting ASR for this speaker. It shows that. On the other hand, if the utterance of a particular speaker always shows strong confidence, the ASR proxy uses ASR as the preferred recognition method. To generalize beyond a particular session, a unique speaker reference ID allows the system to recognize a particular speaker (eg, based on the phone number used to connect to the system) Appropriate ASR or HSR resources can be selected.

ソフトウェアアプリケーション１１０１は、ソフトウェア開発者が特定の状況について適切だと思うことのできる、かつ、状況によっては、前の認識経験に基づいて時の経過に伴って生成された、閾値を提供する。例えば、ＨＳＲリソースを介した二重チェックまたは確認によって統計を生成することができる場合、これらの統計は、収集されてデータベース１１０５に記憶される。このような統計からの平均、標準偏差およびモード情報が、このようなアプリケーションの全体的な目標に基づき、ソフトウェアアプリケーション１１０１のソフトウェア開発者によって決定された必要性に応じて、様々な閾値に適用される。 Software application 1101 provides thresholds that can be considered appropriate for a particular situation by a software developer and, in some situations, generated over time based on previous cognitive experiences. For example, if statistics can be generated by double checking or confirmation via HSR resources, these statistics are collected and stored in database 1105. Means, standard deviations and mode information from such statistics are applied to various thresholds based on the overall goal of such applications and as determined by the software developer of software application 1101. The

さらに、統計は、ＡＳＲリソースにさらに依拠することが効果的でなくなるときを決定するのにも使用可能である。例えば、ＡＳＲおよび特定の文法についてのかなりのサンプルサイズの認識品質が、性能が容認可能な認識閾値を超える可能性が低いことを示す場合、このＡＳＲは、この特定の認識タスクに対しては将来の考慮から除外される。この認識タスクはより多くの訓練（または整調）を必要とする可能性があるが、複数の訓練／整調を試みてもうまくいかないことが判明することにより、この特定の認識試行は、プロンプト／文法に対する調節や、新しいＡＳＲまたは新しいバージョンのＡＳＲの使用などの変化が生じるまで、考慮から永久に除外される。 In addition, statistics can be used to determine when further reliance on ASR resources is no longer effective. For example, if a significant sample size recognition quality for an ASR and a particular grammar indicates that the performance is unlikely to exceed an acceptable recognition threshold, this ASR may be future for this particular recognition task. Excluded from consideration. Although this recognition task may require more training (or pacing), it turns out that multiple training / pacing attempts will not work, so this particular recognition attempt It is permanently excluded from consideration until changes occur, such as adjustments or the use of new or new versions of ASR.

統計はまた、ＡＳＲを整調するのにも使用可能である。文法の整調は、文法「赤、緑、または青」において「赤」が使用されるときのパーセントなど、純粋に統計的であることもあり、または、「青」に対する「ターコイズ」など、類義語を含む可能性もある。後者の場合、整調は、ＨＳＲリソースを「文法外」レコグナイザに使用することによって容易になる（例えば、特定の場合に「ターコイズ」が「青」の類義語と考えられるべきであることを確認するために）。このような整調の直後は、適用例によっては、整調されたＡＳＲを、本番ベースではなく「サイレントな」限定テストベースで導入して、性能が容認可能な閾値よりも高いことを確実にすることが望ましいであろう。一実施形態では、ＡＳＲが当該の文法を認識できることを検証するために、かつ、上述した妥当性検査の間に信頼度閾値統計を計算するために、かつ、ＡＳＲによる認識が無効な場合に信頼度閾値統計を計算するために、ＨＳＲが利用される。妥当性検査の後でも、ＡＳＲまたはＨＳＲリソースによるランダム二重チェックが、選択された認識方法の妥当性に対する継続的なチェックを提供する。このようなチェックの頻度は、一実施形態では、正しいＡＳＲ認識と間違ったＡＳＲ認識との間の統計的偏差に基づく。具体的な例として、正しい認識の平均信頼度が５６であり、間違った認識の平均信頼度が３６である状況を考えてみる。標準偏差が小さい（例えば８）場合、このことは、正しい認識と間違った認識との間には実際上の混乱はほとんどないことを示唆することになり、従って、二重チェックはあまり頻繁に使用する必要はない。しかし、標準偏差がより大きい（例えば１２）場合は、文法信頼度閾値をより細かく整調するために、より頻繁な二重チェックが必要とされるであろう。 Statistics can also be used to tune the ASR. Grammar pacing can be purely statistical, such as the percentage when “red” is used in the grammar “red, green, or blue”, or synonyms such as “turquoise” for “blue”. May also contain. In the latter case, pacing is facilitated by using HSR resources for “out of grammar” recognizers (eg, to confirm that “turquoise” should be considered a synonym for “blue” in certain cases). To). Immediately after such pacing, in some applications, tuned ASR is introduced on a “silent” limited test basis rather than production basis to ensure performance is above acceptable thresholds. Would be desirable. In one embodiment, to verify that the ASR can recognize the grammar, and to calculate confidence threshold statistics during the above-described validation, and if the recognition by the ASR is invalid, HSR is used to calculate degree threshold statistics. Even after validation, a random double check with ASR or HSR resources provides a continuous check on the validity of the selected recognition method. The frequency of such checks is based in one embodiment on the statistical deviation between correct and incorrect ASR recognition. As a specific example, consider a situation where the average confidence of correct recognition is 56 and the average confidence of wrong recognition is 36. If the standard deviation is small (e.g. 8), this would suggest that there is little practical confusion between correct and incorrect recognition, so double checking is used less frequently do not have to. However, if the standard deviation is larger (eg, 12), more frequent double checks will be required to fine tune the grammar confidence threshold.

時の経過に伴って、統計は、ＡＳＲプロキシ１１０２に、その初期動作を変更するよう提案することができる。例えば、非常によい成功が統計的に示唆される場合、このことは、２つのＡＳＲの二重チェックから、１つのＡＳＲのみのチェックへの変更を提案することができる。または、成功が乏しい場合は、特に難しい文法に対して訓練若しくは整調する試みを止めて、代わりにＨＳＲのみを使用することを提案することができる。 Over time, the statistics can suggest to the ASR proxy 1102 to change its initial behavior. For example, if very good success is statistically suggested, this can suggest a change from a double check of two ASRs to a check of only one ASR. Alternatively, if success is poor, it can be suggested to stop trying to train or tune for particularly difficult grammars and use only HSR instead.

ＡＳＲの初期訓練と後続の整調は両方とも、共通の特性を共有し、これらは同様に実施されてよい。しかし、多くの場合、訓練は、初期整調よりも微妙な問題、大きい語彙および統計言語モデルを伴い、従って、整調ではうまく働く技法が、訓練には最適でないことがある。訓練は、かなりより大きいサンプルサイズ、ＨＳＲをより多く使用すること、および文法外ＡＳＲリソースに依拠することを必要とする場合がある。 Both ASR initial training and subsequent pacing share common characteristics, and these may be implemented as well. However, in many cases, training involves more subtle issues, larger vocabulary and statistical language models than initial pacing, and thus techniques that work well in pacing may not be optimal for training. Training may require a much larger sample size, using more HSR, and relying on out-of-grammar ASR resources.

特に複雑な文法は、異なる認識モデルを有する２つのＡＳＲ（異なるベンダからの）による一貫した二重チェックを必要とすることがあり、異なる結果がＨＳＲによって判決される。複数のＨＳＲ（例えば、２つのＨＳＲと、違いを解決するように働く第３のＨＳＲ）に依拠することは、場合によっては、さらに利益をもたらすことができる（例えば、本明細書にその内容が完全に記載されているかのように参照によりその内容が組み込まれる特許文献５参照）。ＡＳＲプロキシ１１０２は、ソフトウェアアプリケーション１１０１を介して、これらの可能性のいずれにも対処するように構成可能である。 Particularly complex grammars may require consistent double checking by two ASRs (from different vendors) with different recognition models, and different results are judged by the HSR. Relying on multiple HSRs (eg, two HSRs and a third HSR that serves to resolve the difference) can in some cases be even more beneficial (eg, the contents of which are described herein). (See Patent Document 5, which is incorporated by reference as if fully described). The ASR proxy 1102 can be configured to address any of these possibilities via the software application 1101.

図１２に移るが、一実施形態では、認識決定エンジン１２０１は、以下のように動作して、履歴統計（例えば、話者、セッション、文法および適用例についての）並びに他の要因に応じて、かつ様々な構成設定に基づいて、どのように発話を処理するか決定する。図１２に示す例では、最初のステップとして、認識決定エンジン１２０１は、ＡＳＲが訓練または整調されるまでＡＳＲが使用されないように指示することができる。これを決定するためにチェック１２０２が行われる。そのように指示される場合は、チェック１２０７が行われて、このような整調／訓練が既に完了したかどうか判定される。そのように指示されない場合は、クイックチェック１２０３が行われ、訓練が必要でないほど文法が十分に単純（例えば文法がごく少数の語末しか有さない）かどうか判定される。文法が単純でない場合は、処理は再びチェック１２０７に移る。文法が十分に単純である場合は、処理はチェック１２０４に移る。上述したチェック１２０７では、この文法に対するＡＳＲ成功についての記憶済み統計と、ＡＳＲが前にこの文法に対して整調／訓練されたかどうか（同じアプリケーション１１０１中であろうと、または類似の目標および対応する信頼度閾値を有する場合のある他のアプリケーション中であろうと）とを調べる。十分に訓練／整調されていることをこれらの統計が示す場合は、チェック１２０７は処理をチェック１２０４に渡す。そうでない場合は、処理はＨＳＲ処理１２１０に進む。 Turning to FIG. 12, in one embodiment, the recognition decision engine 1201 operates as follows, depending on historical statistics (eg, for speakers, sessions, grammars and applications) and other factors: And based on various configuration settings, how to process the utterance is determined. In the example shown in FIG. 12, as a first step, the recognition decision engine 1201 can instruct the ASR not to be used until the ASR is trained or tuned. A check 1202 is performed to determine this. If so, a check 1207 is performed to determine if such pacing / training has already been completed. If not so indicated, a quick check 1203 is performed to determine if the grammar is simple enough (eg, the grammar has only a few endings) that does not require training. If the grammar is not simple, the process returns to check 1207 again. If the grammar is simple enough, processing moves to check 1204. The check 1207 described above includes stored statistics on ASR success for this grammar and whether the ASR has been tuned / trained for this grammar previously (whether in the same application 1101 or similar goals and corresponding trust). In other applications that may have a degree threshold). If these statistics indicate that they are fully trained / tuned, then check 1207 passes the processing to check 1204. Otherwise, processing proceeds to HSR processing 1210.

チェック１２０４では、データベース１１０５に記憶された信頼度統計、およびＡＳＲが特定の文法を理解できる閾値と、セッション内で話者を認識することの進行中の信頼度における第２の統計とを使用する。整調または訓練されない単純な文法の場合、ＡＳＲがどれくらいうまく認識タスクを実施しているかに関する進行中の統計が、アプリケーションによって提供される予期される認識信頼度閾値、またはプロキシによって計算された閾値と比較される。最初の認識が実施されつつある場合では、閾値は、満たされないと自動的に見なされるように設定されてよく、強制的にＨＳＲによって認識されるようにして、プロキシによって閾値を最初に計算できるようにする。ある実施形態では、閾値は、現在の文法に関する履歴情報によって増補される。追加で、ＡＳＲの話者認識能力が、閾値よりも高い信頼度を示唆する場合は、ＡＳＲ処理が使用されることになり、処理はチェック１２０５に移る。そうでない場合は、ＨＳＲ処理１２１０が使用される。例えば、閾値は、ＡＳＲ認識が信頼度（または調節済み信頼度、例えば高価値の話者）未満になる回数として設定することができる。適用例によっては、この回数は、信頼度未満のＡＳＲ認識が１回でもあれば後続の認識をＨＳＲによって実施させるように、低く設定される。 Check 1204 uses confidence statistics stored in database 1105 and a threshold at which ASR can understand a particular grammar and a second statistic on the confidence in progress of recognizing the speaker in the session. . For simple grammars that are not pacing or trained, the ongoing statistics on how well the ASR is performing the recognition task are compared to the expected recognition confidence threshold provided by the application or the threshold calculated by the proxy Is done. In cases where initial recognition is being implemented, the threshold may be set to be automatically considered as not being met, so that the threshold can be initially calculated by the proxy, forcing it to be recognized by the HSR. To. In some embodiments, the threshold is augmented with historical information about the current grammar. Additionally, if the ASR speaker recognition capability suggests a confidence level higher than the threshold, then ASR processing will be used and processing moves to check 1205. Otherwise, the HSR process 1210 is used. For example, the threshold can be set as the number of times that ASR recognition is less than confidence (or adjusted confidence, eg, a high-value speaker). Depending on the application, this number is set low so that if there is even one ASR recognition with less than confidence, subsequent recognition is performed by the HSR.

チェック１２０５で、ソフトウェアアプリケーション１１０１または別の構成要素（例えば訓練若しくは妥当性検査のための要件）により認識に二重チェックの使用が必要とされるかどうか判定する。二重チェックの使用が必要とされない場合は、処理はステップ１２０６に移り、単一のＡＳＲが認識に使用される。 A check 1205 determines whether the use of double checking is required for recognition by the software application 1101 or another component (eg, a requirement for training or validation). If the use of double checking is not required, processing moves to step 1206 where a single ASR is used for recognition.

二重チェックが必要とされる場合は、処理はチェック１２０８に移り、２つ以上のＡＳＲによって二重チェックを行うことができるか（例えば、訓練されたＡＳＲおよび他の形で容認可能なＡＳＲが、２つ以上利用可能なので）どうか判定する。行うことができる場合は、処理はステップ１２０９に移り、そのような複数のＡＳＲによって認識が実施される。行うことができない場合、例えばＡＳＲが認識に適さないかまたはＡＳＲ妥当性検査が実施されることになる場合は、処理はステップ１２１０および１２１１に移り、従って、認識はＡＳＲリソースとＨＳＲリソースの両方によって実施される。 If double checking is required, processing moves to check 1208, where double checking can be performed by more than one ASR (eg, trained ASR and other forms of acceptable ASR Determine if there are more than two available). If so, the process moves to step 1209 where recognition is performed by such multiple ASRs. If this is not possible, for example if the ASR is not suitable for recognition or if an ASR validation will be performed, then the process moves to steps 1210 and 1211, so that recognition is performed by both ASR and HSR resources. To be implemented.

ＡＳＲまたはＨＳＲが認識を完了すると、認識に関する統計が統計データベース１１０５に記憶される。 When the ASR or HSR completes recognition, statistics regarding recognition are stored in the statistics database 1105.

図１１に関して上述したように、ＡＳＲプロキシ１１０２はまた、結果決定エンジン１１０７と通信する。このようなエンジンの目的は、ＡＳＲ／ＨＳＲリソースによる認識プロセスの結果を評価することである。図１３を参照すると、例示的な結果決定エンジン１３０１が示されており、この動作について次のように述べる。結果決定エンジン１３０１は、１または複数のＡＳＲ／ＨＳＲリソースからの認識の結果を検討し、適切な次のステップを決定する。最初に、チェック１３０２が行われて、報告された信頼度レベルが、ソフトウェアアプリケーション１１０１によって設定されたかまたはＡＳＲプロキシ１１０２によって計算された認識閾値を満たすかどうか判定される。満たす場合は、認識成功を反映するように妥当性検査統計が更新されて（１３０３）、結果決定エンジン１３０１の動作は完了する。満たさない場合は、さらに処理が必要とされるので、「フィラー（filler）」プロンプトがユーザに提供される（１３０４）。例えば、発信者は、「まだ作業中なのでお待ち下さい。」と言われることがある。発信者に提供される特定のメッセージは、このようなデフォルトメッセージである場合もあり、または、何らかの形の参照を介してソフトウェアアプリケーション１１０１によって提供および決定されるより具体的なメッセージである場合もある。 As described above with respect to FIG. 11, ASR proxy 1102 also communicates with result determination engine 1107. The purpose of such an engine is to evaluate the results of the recognition process with ASR / HSR resources. Referring to FIG. 13, an exemplary result determination engine 1301 is shown and this operation will be described as follows. The result determination engine 1301 examines the results of recognition from one or more ASR / HSR resources and determines an appropriate next step. Initially, a check 1302 is performed to determine if the reported confidence level meets the recognition threshold set by the software application 1101 or calculated by the ASR proxy 1102. If so, the validation statistics are updated to reflect the recognition success (1303) and the operation of the result determination engine 1301 is complete. If not, further processing is required and a “filler” prompt is provided to the user (1304). For example, a caller may be told, “Please wait because you are still working.” The particular message provided to the caller may be such a default message, or it may be a more specific message provided and determined by the software application 1101 via some form of reference. .

次いで処理は、１または複数のＨＳＲリソースによる認識１３０５に移り、チェック１３０６に移って、ＨＳＲの認識がＡＳＲの認識と一致するかどうか判定される。一致する場合は、統計が再び更新される（１３０３）が、今回は、認識はＨＳＲも必要としたので、統計は比例配分される。一実施形態では、比例配分は、信頼度閾値をクリアしたとすれば提供されたはずのスコアから３分の１の差引きである。 The process then moves to recognition 1305 with one or more HSR resources and moves to check 1306 to determine whether the HSR recognition matches the ASR recognition. If they match, the statistics are updated again (1303), but this time the recognition also requires HSR, so the statistics are prorated. In one embodiment, the proportional distribution is a third deduction from the score that would have been provided if the confidence threshold was cleared.

ＨＳＲとＡＳＲとの間の認識の結果が異なる場合は、チェック１３０８が行われて、二重ＨＳＲが使用されたかどうか判定される。使用された場合は、二重ＨＳＲからの結果が使用され（１３０７）、成功したＡＳＲ認識を追跡する統計がデクリメントされる。使用されなかった場合は、追加のフィラーメッセージが再生され（１３０９）、追加のＨＳＲ認識が企てられる（１３１０）。ＨＳＲ結果が一致しない場合は、ＨＳＲを使用する第３の試みが実施される（これは、ある実施形態では行われるが、他の実施形態では行われない）。ＨＳＲ間に合意がない場合、「合致なし」という結果が返される。これは、どのレコグナイザも話者を理解しないことを示す（従ってＡＳＲへのどんな偏向も示されない）。現在の負荷条件に応じて、第２または第３のＨＳＲを実施するのが実際的でないこともあり、その場合は、単一のＨＳＲ結果が使用されるが、やはりＡＳＲへの偏向はない。このような実施形態では、図１４、図１５および図１６に関しても論じる結果決定エンジンの動作について、同様の処理が使用される。ＡＳＲがＨＳＲ認識と合致すると判定された場合は、処理は完了する。そうでない場合は、処理は１３０７に戻って、上述したように、ＨＳＲ認識を適用し統計を更新する。 If the recognition results between the HSR and ASR are different, a check 1308 is performed to determine if a dual HSR has been used. If used, the results from the double HSR are used (1307) and the statistics that track successful ASR recognition are decremented. If not, an additional filler message is played (1309) and additional HSR recognition is attempted (1310). If the HSR results do not match, a third attempt to use HSR is performed (this is done in one embodiment but not in other embodiments). If there is no agreement between the HSRs, a “no match” result is returned. This indicates that no recognizer understands the speaker (thus not showing any deflection to the ASR). Depending on the current load conditions, it may not be practical to implement the second or third HSR, in which case a single HSR result is used, but again there is no deflection to ASR. In such an embodiment, a similar process is used for the operation of the result determination engine, which is also discussed with respect to FIGS. If it is determined that the ASR matches the HSR recognition, the process is completed. Otherwise, processing returns to 1307 to apply HSR recognition and update statistics as described above.

一実装形態では、ＡＳＲは、認識の結果として文法から選択する必要はないことに留意されたい。ＡＳＲはまた、「合致なし」、「入力なし」または「雑音」という結果を返すこともでき、その場合は、やはりアプリケーションによって確立された基準に応じて、前述のようにさらにＨＳＲ処理が使用される。 Note that in one implementation, the ASR need not be selected from the grammar as a result of recognition. The ASR can also return a result of “no match”, “no input” or “noise”, in which case further HSR processing is used as described above, again depending on the criteria established by the application. The

図１４を参照すると、結果決定エンジン１４０１の一実施形態が示されており、この動作について以下のように述べる。結果決定エンジン１４０１は、２つ以上のＡＳＲリソースからの認識の結果を検討し、適切な次のステップを決定する。最初に、チェック１４０２が行われて、２つのＡＳＲリソースからの結果が一致するかどうか判定される。一致する場合は、チェック１４０３が行われて、信頼度が適切な閾値よりも高いかどうか判定される。一実施形態では、各ＡＳＲはそれ自体の閾値を有し、いずれかのＡＳＲが信頼度閾値よりも高ければ信頼度は十分であると考えられる。その場合、閾値よりも高いレコグナイザについては妥当性検査統計がインクリメントされ（１４０４）（一致するが閾値未満であるＡＳＲがあれば、それについての統計はインクリメントもデクリメントもされない）、処理は完了する。 Referring to FIG. 14, one embodiment of a result determination engine 1401 is shown and this operation is described as follows. The result determination engine 1401 examines the results of recognition from two or more ASR resources and determines an appropriate next step. Initially, a check 1402 is performed to determine if the results from the two ASR resources match. If there is a match, a check 1403 is performed to determine if the reliability is higher than an appropriate threshold. In one embodiment, each ASR has its own threshold, and if any ASR is higher than the confidence threshold, the confidence is considered sufficient. In that case, validation statistics are incremented for recognizers that are higher than the threshold (1404) (if there is an ASR that matches but is less than the threshold, the statistics are not incremented or decremented) and the process is complete.

結果が一致しない場合、または信頼度レベルが十分に高くない場合は、フィラーが発信者に対して再生され（１４０５）、１４０６で、認識を実施するようＨＳＲリソースが呼び出される。次にチェック１４０７が行われて、ＡＳＲ結果のうちの少なくとも１つがＨＳＲ結果と一致するかどうか判定される。一致しない場合、チェック１４０８が行われて、ＨＳＲが二重チェックＨＳＲであったかどうか判定される。そうでなかった場合は、再びフィラーが再生され（１４０９）、追加のＨＳＲ認識１４１０が実施される。ＨＳＲがＡＳＲと一致する場合、またはＨＳＲが二重チェックであった場合、または第２のＨＳＲ１４１０が実施された場合は、処理は移行して、一致するＨＳＲ結果を使用する（１４１１）。これは、一致しないＡＳＲからの統計をデクリメントすることを含み、また、一致するが閾値未満であるＡＳＲがあれば、それらからの統計をデクリメントする（ただし比例配分量、一実施形態では３分の１で）。次に、閾値より高い一致するＡＳＲ妥当性検査統計があれば、それらがインクリメントされ（１４１２）、処理は完了する。 If the results do not match, or if the confidence level is not high enough, the filler is played to the caller (1405) and at 1406 the HSR resource is invoked to perform recognition. A check 1407 is then performed to determine if at least one of the ASR results matches the HSR result. If not, a check 1408 is performed to determine if the HSR was a double check HSR. If not, the filler is regenerated again (1409) and additional HSR recognition 1410 is performed. If the HSR matches the ASR, or if the HSR is a double check, or if the second HSR 1410 is implemented, the process moves to use the matching HSR result (1411). This includes decrementing statistics from non-matching ASRs, and decrementing statistics from any ASRs that match but below the threshold (but proportionally distributed, in one embodiment 3 minutes) 1). Next, if there are matching ASR validation statistics above the threshold, they are incremented (1412) and the process is complete.

図１５は、１または複数のＡＳＲリソースが１または複数のＨＳＲリソースと共に使用される場合の、結果決定エンジンの処理を示す。この場合の、特定の結果決定エンジン１５０１の動作は、結果が全て一致するかどうかチェックすること（１５０２）によって開始する。一致する場合は、上記のように、チェック１５０３が行われて、各ＡＳＲについての信頼度がその閾値よりも高いかどうか判定され、閾値よりも高い場合は、妥当性検査統計がインクリメントされる（１５０４）。上述したように、一致するが閾値未満であるＡＳＲがあれば、それらについては比例配分の差引きでインクリメントされる。次いで処理は完了する。 FIG. 15 illustrates the processing of the result determination engine when one or more ASR resources are used with one or more HSR resources. The operation of a particular result determination engine 1501 in this case begins by checking (1502) whether all the results match. If there is a match, a check 1503 is performed as described above to determine if the confidence for each ASR is higher than that threshold, and if so, the validation statistics are incremented ( 1504). As described above, if there is an ASR that matches but is less than the threshold, they are incremented by subtraction of proportional distribution. The process is then complete.

結果が一致しない場合、チェック１５０５が行われて、二重チェックＨＳＲが使用されたかどうか判定され、使用されなかった場合は、フィラーが再生され（１５０６）、第２のＨＳＲ認識１５０７が実施される。次いで、ＨＳＲ結果が一致すると仮定して、上述したように、ＨＳＲ結果が使用され（１５０８）、一致しないＡＳＲについての統計がデクリメントされる。ＨＳＲ結果が一致しない場合は、処理は図１３に関して上述したように継続する。一致するＡＳＲがあれば、それらについては、完全に、または上述したように比例配分方式で、妥当性検査統計がインクリメントされる（１５０９）。次いで処理は完了する。 If the results do not match, a check 1505 is made to determine if a double check HSR has been used, and if not, the filler is regenerated (1506) and a second HSR recognition 1507 is performed. . The HSR result is then used (1508) and the statistics for the unmatched ASR are decremented, as described above, assuming the HSR results match. If the HSR results do not match, processing continues as described above with respect to FIG. If there are matching ASRs, the validation statistics are incremented for them either completely or in a proportional manner as described above (1509). The process is then complete.

図１６を参照すると、ＨＳＲリソースのみが使用される場合の、結果決定エンジン１６０１の一実施形態の処理が示されている。最初のチェック１６０２で、二重チェックＨＳＲが使用されたかどうか判定する（呼出し元アプリケーションによって二重チェックＨＳＲが必要とされたと仮定して）。二重チェックが使用されなかった場合は、フィラーが再生され（１６０３）、第２のＨＳＲ認識１６０４が実施されて、認識が正しいことが確実にされる。 Referring to FIG. 16, the processing of one embodiment of a result determination engine 1601 is shown when only HSR resources are used. The first check 1602 determines whether a double check HSR has been used (assuming a double check HSR was required by the calling application). If double checking was not used, the filler is regenerated (1603) and a second HSR recognition 1604 is performed to ensure that the recognition is correct.

次にチェック１６０５が行われて、ＨＳＲの結果が一致するかどうか判定される。一致しない場合は、処理は完了し、一実施形態では、呼出し元アプリケーションの要件を満たすために、第３のＨＳＲ認識（図示せず）など、このプロセスの範囲外のさらに他の処理が必要とされることになる。このような場合、第３の認識の後に収束がない場合は、「合致なし」状況が宣言され、これは、認識の試みが失敗したことを示す。収束がある場合は、少なくとも２つの一致するＨＳＲの結果が使用される。 A check 1605 is then performed to determine if the HSR results match. If not, processing is complete and in one embodiment, further processing outside the scope of this process is required, such as a third HSR awareness (not shown), to meet the calling application's requirements. Will be. In such a case, if there is no convergence after the third recognition, a “no match” situation is declared, indicating that the recognition attempt has failed. If there is convergence, then at least two matching HSR results are used.

チェック１６０５における２つのＨＳＲ結果が一致する場合は、処理は完了し、例えば認識された発話は、前述のような整調／訓練のためのグループに追加することができる。プロンプトに対する応答の解釈は、テキスト分析の２つの種類、すなわち情報抽出およびセンス分類として見ることができる。情報抽出は、顧客ＩＤ、電話番号、日時、住所、製品タイプ、問題など、用件フォームのスロットを埋めるのに不可欠な特定の情報断片を、識別、抽出および正規化することである。センス分類は、追加の２つの情報タイプ、即ち意味（意図）および応答品質を識別することに関係する。意味（意図）は、どんな種類のフォームを埋める必要があるかということと関係がある（料金請求、予約のスケジューリング、苦情など）。応答品質は、応答自体と関係がある（不明瞭、雑音、英語ではなくスペイン語、生のエージェントと話したいという要望など）。 If the two HSR results in check 1605 match, the process is complete, for example, the recognized utterance can be added to a group for pacing / training as described above. Interpreting responses to prompts can be viewed as two types of text analysis: information extraction and sense classification. Information extraction is the identification, extraction and normalization of specific pieces of information that are essential to fill a slot in a business form, such as customer ID, phone number, date and time, address, product type, and problem. Sense classification relates to identifying two additional types of information: meaning (intent) and response quality. The meaning (intent) has to do with what kind of form needs to be filled in (billing, booking scheduling, complaints, etc.). Response quality has something to do with the response itself (indistinct, noisy, Spanish rather than English, a desire to talk to a live agent, etc.).

図１７を参照するが、上述の方法およびシステムを実現して、人間らしい体験を最大限にすることができる。予測最適化１７３０およびメディア加速１７３４の結果に示すように、ＡＳＲプロキシからアプリケーションに応答し返すための全体的な認識ギャップ時間は、一例では１．２５秒に短縮することができる。図１７の具体的なグラフを詳しく検討するが、１７１０は、最適化されない典型的な認識体験を表す。認識すべきメディア（発話）は、３．７５秒の長さである（１７５０）。この場合に、ＡＳＲプロキシがメディアをリアルタイムでストリーミングするが、通常、自動認識を完了するには、メディアストリームの終わりから、１秒の数分の１だけ多くかかる（１７１２）。ＡＳＲプロキシの結果決定エンジンは、ＨＳＲ（後述する図１８の１８６０）が必要だと決定するが、メディア（発話）は始めから処理される必要があり、これにより、もう約４秒が追加され（１７１４）、ユーザから見たギャップは少なくとも４．２５秒になる。このギャップは、アプリケーション１８１０によって、業界で「フィラープロンプト」としばしば呼ばれる方式で埋めることができ、それにより、システムがまだ問題に取り組んでいることをユーザが確実に認識するようにする。このフィラープロンプトは、発信者とのより人間らしい対話を生み出す目標を達成しないのは確かである。グラフ１７１５に移ると、システムは、メディアを例えば１秒加速させることによって改善を図ることができ、それにより、人間援助による理解を３秒に短縮し（１７１９）、認識ギャップまでのメディア停止を３．２５秒に短縮することができる。これはなかなかの改善である。１７３０に示すように、自動認識が、部分認識予測器を使用して、より短い時間で結果の予測を提供する。１７３２に示すように、認識が失敗したと判定するのに２秒しかかからず、その後、ＡＳＲプロキシは、人間援助を求めてメディアをストリーミングし、メディアを加速させる。結果として、メディアの終わりから人間援助の成功までの全体的な認識ギャップは、４．２５秒から１．２５秒に大きく短縮された。これにより、ＡＳＲプロキシの認識ギャップは、人間らしい対話により近く合致する範囲に短縮される。 Referring to FIG. 17, the methods and systems described above can be implemented to maximize the human experience. As shown in the results of predictive optimization 1730 and media acceleration 1734, the overall recognition gap time to respond back to the application from the ASR proxy can be reduced to 1.25 seconds in one example. Considering the specific graph of FIG. 17 in detail, 1710 represents a typical cognitive experience that is not optimized. The media (speech) to be recognized is 3.75 seconds long (1750). In this case, the ASR proxy streams the media in real time, but it usually takes more than a fraction of a second from the end of the media stream to complete automatic recognition (1712). The result determination engine of the ASR proxy determines that an HSR (1860 in FIG. 18 to be described later) is required, but the media (utterance) needs to be processed from the beginning, which adds about 4 more seconds ( 1714), the gap as seen by the user is at least 4.25 seconds. This gap can be filled by the application 1810 in a manner often referred to in the industry as the “filler prompt”, thereby ensuring that the user is aware that the system is still working on the problem. Certainly this filler prompt will not achieve the goal of creating a more human interaction with the caller. Moving to graph 1715, the system can improve by accelerating the media, for example, by 1 second, thereby reducing human assisted understanding to 3 seconds (1719) and reducing the media pause to the recognition gap by 3 Can be shortened to 25 seconds. This is a considerable improvement. As shown at 1730, automatic recognition provides prediction of results in a shorter time using a partial recognition predictor. As shown at 1732, it takes only 2 seconds to determine that the recognition has failed, after which the ASR proxy streams the media for human assistance and accelerates the media. As a result, the overall recognition gap from the end of the media to the success of human assistance has been greatly reduced from 4.25 seconds to 1.25 seconds. As a result, the recognition gap of the ASR proxy is reduced to a range that more closely matches human-like dialogue.

図１８に、ＡＳＲプロキシの主要なシステムコンポーネントを示し、図１１のいくつかの要素を詳述してＡＳＲプロキシをさらに例証する。図１１の図解の一部にはないが、本開示内にはユーザ状態管理ストア１８１３があり、明確にするためにこれを図１８に特に示す。ユーザ状態管理１８１３は、ユーザに関する情報（例えば、ユーザ識別、好ましい通信チャネルおよび所有機器）を有する。認識成功（人間援助ではなく自動化）など、ユーザの処理にとって重要な情報が、将来の使用のために統計ストア１８３０に記憶される。システムは、各対話のステータスに関する情報を維持する。この情報は、一方では、意図分析の利用可能性に関する情報からなり、他方では、提示された認識要求と、これらの要求に対する応答と、これらの応答の意味（意図）と、これらの応答から抽出された特定の内容と、プロキシが次にどんなアクションを実施することになるかとのシーケンスに関する情報からなる。 FIG. 18 illustrates the main system components of the ASR proxy, further illustrating some elements of FIG. 11 and further illustrating the ASR proxy. Although not part of the illustration of FIG. 11, there is a user state management store 1813 within this disclosure, which is specifically shown in FIG. 18 for clarity. User state management 1813 has information about the user (eg, user identification, preferred communication channel and owned equipment). Information important to the user's processing, such as recognition success (automation rather than human assistance), is stored in the statistics store 1830 for future use. The system maintains information about the status of each interaction. This information, on the one hand, consists of information on the availability of intent analysis, and on the other hand extracted from the recognition requests presented, the responses to these requests, the meaning (intentions) of these responses, and these responses. Information about the specific content made and the sequence of what actions the proxy will perform next.

プロキシ処理システムは、特定のプロンプトと、このプロンプトに対する応答の意味（意図）と、この応答から抽出された特定の情報とに基づいて、そのアクション（すなわち、どんな追加情報をユーザに要求するか、およびその情報を用いてどんなアクションを次に実施するか）を調整する。システムステータスサブシステム１８１５は、ＨＳＲキャパシティまたはある実施形態ではシステム負荷と、これがどのように自動認識および人間認識の使用に影響を及ぼすかとを、常に把握している。図１８の残りの要素については、他の図に関して上述したとおりであり、ここでは、ＡＳＲ／ＮＬＵ１８５０は、利用できる複数のＡＳＲ／ＮＬＵインスタンスを表すように複数の円で特に示されている。 Based on the specific prompt, the meaning (intent) of the response to this prompt, and the specific information extracted from this response, the proxy processing system determines its action (ie what additional information is requested from the user), And what action to perform next using that information). The system status subsystem 1815 keeps track of HSR capacity or in some embodiments system load and how this affects the use of automatic and human recognition. The remaining elements of FIG. 18 are as described above with respect to the other figures, where the ASR / NLU 1850 is specifically shown with multiple circles to represent multiple available ASR / NLU instances.

図１９に、システムステータスの評価に基づいてＡＳＲまたはＤＴＭＦの機能を場合により使用する（アプリケーションに基づいて適切なら）、決定エンジンの動作を示す。本明細書では、これらの動作は認識決定エンジン１９８０および結果決定エンジン１９９０によって処理されるものとして述べるが、様々なメモリおよびプロセッサアーキテクチャを使用してこのようなエンジンを実現できることは、当業者なら認識するであろう。認識に関する統計がない場合（１９００）は、十分なＨＳＲキャパシティがない場合にＤＴＭＦ手法を使用して自動化するようアプリケーションに知らせること以外には、自動化は使用されない。ＤＴＭＦがアプリケーションに利用可能にされることになり、アプリケーションは、業務規則によってＤＴＭＦの変形が利用可能にされることを許容する。この実施形態では、ＤＴＭＦは、アプリケーションからの第２の認識要求に基づいて使用されることになる。様々な実施形態で、アプリケーションは、利用可能であることを無視して後続の認識を試みることを選ぶこともでき、または、ある認識要求に対してはＤＴＭＦを使用し、最も難しいアイテムはＨＳＲに任せることを選ぶこともできる。例えば、電話番号のデータ収集はＤＴＭＦによって容易に行うことができるが、ＥメールアドレスはＨＳＲによってより適切に扱われる。 FIG. 19 illustrates the operation of the decision engine, optionally using ASR or DTMF functionality based on system status evaluation (if appropriate based on application). Although these operations are described herein as being handled by a recognition decision engine 1980 and a result decision engine 1990, those skilled in the art will recognize that such engines can be implemented using various memory and processor architectures. Will do. If there are no recognition statistics (1900), no automation is used other than informing the application to automate using the DTMF approach if there is not enough HSR capacity. The DTMF will be made available to the application, and the application will allow DTMF variants to be made available by business rules. In this embodiment, DTMF will be used based on the second recognition request from the application. In various embodiments, the application can choose to ignore subsequent availability and attempt subsequent recognition, or use DTMF for certain recognition requests, with the most difficult items in the HSR. You can choose to leave it to me. For example, data collection of telephone numbers can be easily performed by DTMF, but email addresses are more appropriately handled by HSR.

アプリケーションが、ある実施形態で、システムステータス１８１５および統計の利用可能性１８３０に応じて、通知し（１９００Ｒ）、いくつかの形の人間らしい対話を提供する。即ち、これらの対話は、（１）人間援助による理解のみを使用した、人間らしい対話１９２５、（２）自動化と人間援助の組合せを高品質で使用した、人間らしい対話１９３０、（３）アプリケーションが異なる品質に応答できることを必要とせずに、自動化と人間援助の組合せを負荷要因に応じて可変品質で使用した、人間らしい対話１９３０、（４）アプリケーションがより低い自動化信頼度に合わせて検証促進を増加させる、自動化１９５０と人間援助１９６０の組合せを負荷要因に応じて可変品質で使用した、人間らしい対話１９３０または１９４０、および、（５）ＤＴＭＦダイアログなど、人間らしくなることが意図されない対話１９４０である。このように、システムは、ＡＳＲプロキシの機能とシステムの負荷とに応答して、種々のタイプのプロンプトを提示する。例えば、（５）の場合では「販売については１を押して下さい。・・・については２を押して下さい・・・」となるが、同じ質問が、「どういったご用件ですか？」として言い換えられることになり、これは（１）の場合を例証する。 The application, in one embodiment, notifies (1900R) and provides some form of human interaction depending on system status 1815 and statistics availability 1830. That is, these dialogues are (1) human-like dialogue 1925 that uses only human-assisted understanding, (2) human-like dialogue 1930 that uses a combination of automation and human assistance in high quality, and (3) different application quality Human interaction 1930, using a combination of automation and human assistance with variable quality depending on load factors, without needing to be able to respond to (4) the application increases validation acceleration to lower automation reliability, A human-like dialogue 1930 or 1940 that uses a combination of automation 1950 and human assistance 1960 in variable quality depending on the load factor, and (5) a dialogue 1940 that is not intended to be human-like, such as a DTMF dialog. In this way, the system presents various types of prompts in response to ASR proxy capabilities and system load. For example, in the case of (5), "Please press 1 for sales. Please press 2 for ...", but the same question is "What is your business?" In other words, this illustrates the case of (1).

図２０は、図１８および図１９で主に述べたようなロジックおよびコンポーネントを含み、統計を用いたＡＳＲおよびＨＳＲ処理のフローを示す。図２１および図２２は、任意選択の同時並行フローであることに留意されたい。図２０は、認識決定エンジン２０００および結果決定エンジン２０２０を使用し、これらは、統計１８２０をシステムステータス情報１８１５と共に使用し、任意選択で、認識メディア（音声、ビデオ）を加速させて（２０１０）、自動化と人間援助との間のフェイルオーバ時間を短縮する。 FIG. 20 shows the flow of ASR and HSR processing using statistics, including logic and components as mainly described in FIGS. 18 and 19, and using statistics. Note that FIGS. 21 and 22 are optional concurrent flows. FIG. 20 uses a recognition decision engine 2000 and a results decision engine 2020 that use statistics 1820 with system status information 1815 and optionally accelerates the recognition media (voice, video) (2010), Reduce failover time between automation and human assistance.

図２１に、任意選択の並行フローを示すが、この場合、認識決定エンジン２１００における認識およびシステムステータス１８１５に、タイマ統計が組み合わせられる。メディアが、通常うまく認識できるもの（システム負荷に従って調節できる）よりも長い場合は、タイマイベントが発火し、認識は人間援助１８６０に移る。結果決定エンジン２１５０は、前述のように動作する。 FIG. 21 illustrates an optional parallel flow where timer statistics are combined with recognition and system status 1815 in recognition decision engine 2100. If the media is longer than can normally be recognized well (adjustable according to system load), a timer event is fired and recognition passes to human assistance 1860. The result determination engine 2150 operates as described above.

図２２に、任意選択の並行フローを示すが、この場合、認識決定エンジン２２００中で、システム負荷予測信頼度調節に応じてメディアの一部に対する認識予測が行われ、認識が十分に成功でない場合、認識は人間援助１８６０に移る。結果決定エンジン２２５０は、やはり前述のように動作する。 FIG. 22 shows an optional parallel flow. In this case, the recognition determination engine 2200 performs recognition prediction for a part of the media in accordance with the system load prediction reliability adjustment, and the recognition is not sufficiently successful. Recognition shifts to human assistance 1860. The result determination engine 2250 still operates as described above.

図２３に、メディアおよびメディアの意味の周りでデータを収集して、意味抽出のための最適な文法および分類器を構築するための、かつ最適な認識予測器も構築するための、整調サブシステム／フローを示す。図２３では、ＡＳＲ２３１０および分類器自動化が、アプリケーション中のプロンプトの選択されたサブセット２３２０に適用される使用ケースについて述べる。アプリケーションプロンプトのセットは様々なカテゴリに入るが、これらのうちのいくつかは自動化の自明な候補であり、いくつかは自動化が困難である。例えば、はい／いいえプロンプト、および限られたオプションプロンプトの場合は通常、ユーザ発話のレパートリはごく限られ、意図ラベルはごく少数となる。これらのタイプのプロンプトを評価しモデル化するには、ＡＳＲ文法に対しても統計言語モデルに対しても機械学習分類器２３４０に対しても、比較的少量のデータしか必要でない。他方、自由回答式プロンプトでは、ユーザははるかに制約の少ない発話セットを生むことができるが、自由回答式プロンプトはより難しい。これらは、一般的と領域特有の両方の知識ベース２３３０によって増補することができる。これらのタイプのプロンプトには、比較的より多量のデータが必要である。多量のデータがあるときであっても、全てのタイプの発話または意図ラベルについての信頼できるモデルを生むには、なお多様性が大きすぎる場合もある。言い換えれば、これらの場合、プロンプトの言語的、カテゴリ的および統計的な特性を確立して、これらの特性に基づいてプロンプトの選択および策定を駆動することによって、自動化は進行する。これは、次のような１組の相関するタスクを伴う。 FIG. 23 shows a pacing subsystem for collecting data around media and media meaning to build an optimal grammar and classifier for semantic extraction, and also to build an optimal recognition predictor / Shows the flow. FIG. 23 describes use cases where ASR 2310 and classifier automation are applied to a selected subset 2320 of prompts in an application. Although the set of application prompts falls into various categories, some of these are obvious candidates for automation and some are difficult to automate. For example, yes / no prompts and limited option prompts usually have a very limited repertoire of user utterances and very few intent labels. Evaluating and modeling these types of prompts requires a relatively small amount of data, whether for an ASR grammar, a statistical language model, or a machine learning classifier 2340. On the other hand, free answer prompts allow the user to generate a much less constrained utterance set, but free answer prompts are more difficult. These can be augmented by both general and domain specific knowledge bases 2330. These types of prompts require a relatively large amount of data. Even when there is a large amount of data, the diversity may still be too great to produce a reliable model for all types of utterances or intention labels. In other words, in these cases, automation proceeds by establishing linguistic, categorical and statistical characteristics of prompts and driving prompt selection and formulation based on these characteristics. This involves a set of correlated tasks as follows.

− 発話をそれらの特性に基づいて種々のカテゴリに分類する。
− ＡＳＲおよび分類器自動化に適した特性により、所与のアプリケーションの候補プロンプトを識別する。
− 早期認識の成功または失敗に対して予測器を決定する。
− 各プロンプトにつき、このプロンプトによって生成される発話に対するＡＳＲのための音響モデルおよび言語モデルと、このプロンプトのターゲット意図についての分類器モデルとを作り出し、整調し、記憶する。
− ＡＳＲおよび分類器自動化と、人間による意図分析とを、いつ利用またはトレードオフするかを決めるための、選択基準を決定する。 -Classify utterances into various categories based on their characteristics.
Identify candidate prompts for a given application with characteristics suitable for ASR and classifier automation.
-Determine predictors for early recognition success or failure.
For each prompt, create, tune and store an acoustic and language model for the ASR for the utterances generated by this prompt and a classifier model for the target intent of this prompt.
Determine selection criteria for deciding when to use or trade off ASR and classifier automation and human intention analysis.

図２４は、北米電話番号を複数の認識構成要素に分割する例を用いた、どのようにタイマ統計を計算できるか、並びに非常に単純な予測器の例である。要素２４０１から２４０３は、特定の質問（プロンプト）をうまく認識した際に集められた統計を表す。要素２４０１は、長さが２秒以下の種類の発話を表す。この長さは、この例で統計を有する全ての発話の１５％を表す。ＡＳＲは、２秒以下の発話に対して９０％成功と決定された。要素２４０２は、２秒よりも長く３秒以下の種類の発話を表し、ＡＳＲ認識の成功は７５％であり、発話の２５％がこのグループに入る。要素２４０３は、３秒よりも長く４秒以下の種類の発話を表す。これは、システムステータスによって影響を受ける可能性のある使用ケースの例である。十分なＨＳＲリソースがある場合は、タイマを確立して認識を３秒（２４０２）で中断し、ＡＳＲを使用して発話の３２．３％をうまく認識することができる。または、システム負荷が増大した場合は、タイマを４秒（２４０３）に調節し、４４．３％を認識することができる。非常に高い負荷の下では、ＡＳＲプロキシは、タイマを使用しないことを決定することができることに留意されたい。但しこれは、話者にとってより長い待機時間を引き起こす。しかしこの結果、５５．３％までがうまく認識される。 FIG. 24 is an example of how a timer statistic can be calculated, as well as a very simple predictor, using an example of dividing a North American telephone number into multiple recognition components. Elements 2401 to 2403 represent statistics gathered upon successful recognition of a particular question (prompt). Element 2401 represents an utterance of a type whose length is 2 seconds or less. This length represents 15% of all utterances that have statistics in this example. ASR was determined to be 90% successful for utterances of less than 2 seconds. Element 2402 represents a type of utterance that is longer than 2 seconds and not longer than 3 seconds, with ASR recognition success of 75% and 25% of utterances falling into this group. Element 2403 represents a type of utterance that is longer than 3 seconds and not longer than 4 seconds. This is an example of a use case that may be affected by the system status. If there is enough HSR resources, a timer can be established to interrupt recognition in 3 seconds (2402) and ASR can be used to successfully recognize 32.3% of utterances. Or, if the system load increases, the timer can be adjusted to 4 seconds (2403) to recognize 44.3%. Note that under very high load, the ASR proxy can decide not to use a timer. However, this causes a longer waiting time for the speaker. However, as a result, 55.3% is well recognized.

要素２４０４は、３桁のエリアコードのＡＳＲ認識を表す。要素２４０５は、３桁のエリアコードのＡＳＲ認識と、それに加えて３桁の交換局の認識を表す。要素２４０６は、北米電話番号全体のＡＳＲ認識を表す。例えば、電話番号を話すのに約８秒かかる場合、各ステップ２４０４、２４０５および２４０６は、発話を処理するのにより多くの時間がかかる。第１のステップ２４０４は時間のうちの約３０％（２．４秒）かかり、ステップ２は時間のうちの６０％（４．８秒）かかり、３つの認識ステップのうちのいずれかが信頼度未満の結果を示す場合は、認識は人間援助に移る。例えば、エリアコードが正しく認識されない場合、電話番号全体が話された後で初めて失敗するのではなく、電話番号が話されている間に、ＨＳＲの使用が２．４秒以内に起こる可能性がある。 Element 2404 represents ASR recognition of a 3-digit area code. Element 2405 represents the ASR recognition of the 3-digit area code and in addition the recognition of the 3-digit switching office. Element 2406 represents ASR recognition of the entire North American telephone number. For example, if it takes about 8 seconds to speak a telephone number, each step 2404, 2405, and 2406 takes more time to process the utterance. The first step 2404 takes about 30% of the time (2.4 seconds), and step 2 takes 60% of the time (4.8 seconds), and any of the three recognition steps is reliable. If it shows less than the result, recognition shifts to human assistance. For example, if the area code is not recognized correctly, the use of HSR may occur within 2.4 seconds while the phone number is spoken, rather than failing only after the entire phone number is spoken. is there.

様々な実施形態および実装形態で、この応答解釈は、意図分析者のみ（純粋なＨＳＲ）によって行うか、自動化ＡＳＲ（純粋な自動音声認識および意図分類）によって行うか、またはＡＳＲとＨＳＲの何らかの組合せによって行うことができる。ＡＳＲ自動化の結果における信頼度を使用して、いつＡＳＲが信頼できる結果を生成しているかを決定することで、品質損失なしに（または制御された品質損失で）ＡＳＲ自動化をＨＳＲに対してトレードオフすることが可能である。このことは、プロキシ処理システムにおけるこの２つの手法の組合せにより、ＨＳＲのみを使用する場合よりも大きなスループットを達成することができ、より小さい意図分析者チームでピーク需要負荷をうまく満たすこともできることを意味する。 In various embodiments and implementations, this response interpretation is done by intent analysts only (pure HSR), by automated ASR (pure automatic speech recognition and intention classification), or some combination of ASR and HSR Can be done by. Trade ASR automation to HSR without quality loss (or with controlled quality loss) by determining when ASR is producing reliable results using confidence in the results of ASR automation It is possible to turn it off. This means that the combination of these two approaches in a proxy processing system can achieve greater throughput than using only HSR and can better meet peak demand loads with a smaller team of intent analysts. means.

上記の主題については、可能性ある様々な実施形態に関して特に詳細に述べた。主題を他の実施形態で実践することもできることを、当業者なら理解するであろう。第１に、コンポーネントの特定の命名、用語の大文字使用、属性、データ構造またはいずれか他のプログラミング上若しくは構造上の側面は、必須でも有意でもなく、主題またはその特徴を実現するメカニズムは、異なる名称、フォーマットまたはプロトコルを有してもよい。さらに、システムは、述べたようにハードウェアとソフトウェアの組合せを介して実現されてもよく、または完全にハードウェア要素において実現されてもよい。また、本明細書で述べた、様々なシステムコンポーネント間における機能の特定の分割は、例に過ぎず、必須ではない。単一のシステムコンポーネントによって実施される機能が、代わりに複数のコンポーネントによって実施されてもよく、複数のコンポーネントによって実施される機能が、代わりに単一のコンポーネントによって実施されてもよい。 The above subject matter has been described in particular detail with respect to various possible embodiments. Those skilled in the art will appreciate that the subject matter may be practiced in other embodiments. First, the specific naming of components, capitalization of terms, attributes, data structures, or any other programming or structural aspects are not required or significant, and the mechanisms that implement the subject matter or its features are different It may have a name, format or protocol. Furthermore, the system may be implemented through a combination of hardware and software as described, or may be implemented entirely in hardware elements. Also, the specific division of functionality between the various system components described herein is only an example and is not required. Functions performed by a single system component may instead be performed by multiple components, and functions performed by multiple components may instead be performed by a single component.

上述のいくつかの部分では、主題の特徴、プロセスステップおよび命令を、情報に対する操作のアルゴリズムおよび象徴表現で提示している。これらのアルゴリズム的記述および表現は、データ処理技術分野の当業者によって、その作業の本質を他の当業者に最も効果的に伝えるために使用される手段である。これらの操作は機能的または論理的に記述されるが、これらの操作は、ソフトウェア、ファームウェアまたはハードウェアにおいて具体化されてよく、ソフトウェアにおいて具体化されるときは、リアルタイムネットワークオペレーティングシステムによって使用される種々のプラットフォーム上に存在しこれらのプラットフォームから操作されるように、ダウンロードされてよい。 In some of the above, the subject features, process steps and instructions are presented in algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. Although these operations are described functionally or logically, these operations may be embodied in software, firmware, or hardware, and when embodied in software, are used by a real-time network operating system. It may be downloaded to exist on and operate from various platforms.

さらに、一般性を失うことなく、操作のこれらの構成をモジュールとしてまたは機能的名称によって言及することが、時として好都合であることもわかっている。 Furthermore, it has also proven convenient at times to refer to these configurations of operations as modules or by functional names without loss of generality.

特段に明記されない限り、または上記の考察から明らかなように、この記述全体を通して、「決定する」などの用語を利用した考察は、コンピュータシステムメモリ若しくはレジスタ内で、または他のそのような情報記憶、伝送、若しくは表示デバイス内で、物理的（電子的）な量として表されるデータを操作および変換する、コンピュータシステムまたは類似の電子コンピューティングデバイスのアクションおよびプロセスを指すことを理解されたい。 Throughout this description, considerations utilizing terms such as “determine” will be made in computer system memory or registers, or other such information storage, unless otherwise specified or apparent from the above discussion. It should be understood that it refers to actions and processes of a computer system or similar electronic computing device that manipulate and transform data represented as physical (electronic) quantities within a transmission, or display device.

主題はまた本明細書の動作を実施するための装置に関する。この装置は、必要とされる目的のために特に構築されたものであってもよく、またはコンピュータによってアクセスできコンピュータプロセッサによって実行できるコンピュータ可読媒体に記憶されたコンピュータプログラムによって選択的にアクティブ化または再構成される汎用コンピュータを含んでもよい。このようなコンピュータプログラムは、非一時的コンピュータ可読記憶媒体に記憶されてよく、この非一時的コンピュータ可読記憶媒体は、以下のものに限定されないが、フロッピー（登録商標）ディスクや光ディスクやＣＤ−ＲＯＭや光磁気ディスクを含めた任意のタイプのディスク、ＲＯＭ、ＲＡＭ、ＥＰＲＯＭ、ＥＥＰＲＯＭ、磁気若しくは光学カード、ＡＳＩＣまたは電子的命令を記憶するのに適した任意のタイプの媒体などであり、これらは各々コンピュータシステムバスに結合される。さらに、本明細書で言及されるコンピュータは、単一のプロセッサを備えてもよく、またはコンピューティング能力の増大のために複数プロセッサ設計を利用するアーキテクチャであってもよい。 The subject matter also relates to an apparatus for performing the operations herein. The apparatus may be specially constructed for the required purposes, or selectively activated or reactivated by a computer program stored on a computer readable medium that can be accessed by a computer and executed by a computer processor. It may include a general purpose computer configured. Such a computer program may be stored in a non-transitory computer-readable storage medium, and the non-transitory computer-readable storage medium is not limited to the following, but is a floppy (registered trademark) disk, optical disk, or CD-ROM. Or any type of disk, including magneto-optical disks, ROM, RAM, EPROM, EEPROM, magnetic or optical card, ASIC or any type of medium suitable for storing electronic instructions, etc. Coupled to the computer system bus. Further, the computers referred to herein may comprise a single processor or may be an architecture that utilizes a multiple processor design for increased computing power.

また、主題は、いずれか特定のプログラミング言語に関して述べるものではない。様々なプログラミング言語を使用して本明細書に記載の主題の教示を実現できること、並びに、特定の言語へのどんな言及も、主題の使用可能性および最良モードのために提供するものであることを理解されたい。 Also, the subject matter does not describe any particular programming language. That various programming languages may be used to implement the teachings of the subject matter described herein, and that any reference to a particular language is provided for subject matter availability and best mode. I want you to understand.

主題は、多くのトポロジにまたがる幅広いコンピュータネットワークシステムによく適する。この分野内で、大きいネットワークの構成および管理は、インターネットなどのネットワークを介して異種のコンピュータおよび記憶デバイスに通信可能に結合される、記憶デバイスおよびコンピュータを含む。 The subject is well suited to a wide range of computer network systems that span many topologies. Within this field, large network configuration and management includes storage devices and computers that are communicatively coupled to disparate computers and storage devices via a network, such as the Internet.

最後に、本明細書で使用される言語は、主に読みやすさおよび教授目的のために選択されたものであり、主題を線引きまたは制限するために選択されたのではない場合があることに留意されたい。従って、本明細書の開示は、主題の範囲を限定するのではなく例証するものとする。 Finally, the language used herein was selected primarily for readability and teaching purposes, and may not have been selected to delineate or limit the subject matter. Please keep in mind. Accordingly, the disclosure herein is intended to illustrate rather than limit the scope of the subject matter.

Claims

A computer-implemented system for processing an interaction, the interaction comprising an utterance that requires recognition before it can be used for further computer-execution processing, the system comprising:
An application configured to provide the utterance, wherein the utterance is received from a customer device over a computer network;
A recognition decision engine configured to receive the utterance for recognition, wherein the recognition decision engine is different from an automatic speech recognition (ASR) subsystem and the automatic speech recognition subsystem and the computer By the application to dynamically select one or more recognizers from a human speech recognition (HSR) subsystem that communicates over a computer network with a device that is remote from the execution system before recognizing the interaction . A recognition decision engine using the provided parameters;
A result determination engine coupled to the one or more recognizers and configured to provide recognition results.

The system further comprises a system status subsystem operatively connected to the recognition decision engine, the recognition decision engine having system load information from the system status subsystem as input for use in the dynamic selection. The system of claim 1.

The subset of the one or more recognizers is configured to provide a confidence metric to the recognition decision engine, the recognition decision engine using the confidence metric in the dynamic selection. 1 system.

The system of claim 3, wherein the reliability metric includes a threshold, and the threshold varies based on resource availability.

The system of claim 1, wherein the recognition decision engine is configured to choose to select the automatic speech recognition subsystem for the human speech recognition subsystem based on a recognition cost factor.

The system of claim 1, wherein the recognition decision engine is configured to choose to select the automatic speech recognition subsystem for the human speech recognition subsystem based on a human resource availability factor. .

The result determination engine is responsive to a result match between a first recognizer subsystem of the recognizer and a second recognizer subsystem of the recognizer. The system of claim 1, wherein the system is configured to update a confidence threshold associated with the first recognizer subsystem.

The recognition decision engine initially selects a first recognizer subsystem of the automatic speech recognition subsystem and is provided by the first recognizer subsystem of the automatic speech recognition subsystem. Responsive to the first result received, configured to make a subsequent selection of a second recognizer subsystem of the recognizers, the processing of the utterance being performed by the automatic speech recognition sub The system of claim 1, wherein the system is performed before being completed by the first recognizer subsystem of the system.

A computer-implemented method executed by a computer system to process an interaction, wherein the interaction includes an utterance that requires recognition before it can be used for further computer-execution processing, the computer-implemented method comprising:
Receiving data representing an utterance from a computer application, the utterance being received from a customer device over a computer network;
Using the parameters provided by the application, an automatic speech recognizer (ASR) and a human speech recognition recognizer that communicates over a computer network with a device that is different from the automatic speech recognizer and remote from the computer system Dynamically selecting one or more recognizers from (HSR) before recognizing the interaction ;
Providing a recognition result in response to a result of processing by the one or more recognizers.

The computer-implemented method of claim 9, wherein the dynamically selecting is responsive to a system load metric.

The computer-implemented method of claim 9, wherein the dynamically selecting is responsive to a reliability metric.

The computer-implemented method of claim 11, wherein the reliability metric includes a threshold, and the threshold varies based on resource availability.

The computer-implemented method of claim 9, wherein the dynamically selecting is to select the automatic speech recognizer for the human speech recognition recognizer based on a recognition cost factor.

10. The computer-implemented method of claim 9, wherein the dynamically selecting is to select the automatic speech recognizer for the human speech recognition recognizer based on a human resource availability factor. .

In response to a result match between a first recognizer of the recognizers and a second recognizer of the recognizers, update a confidence threshold associated with the first recognizer of the recognizers. The computer-implemented method of claim 9, further comprising:

First selecting a first recognizer of the automatic speech recognizers and in response to a first result provided by the first recognizer of the automatic speech recognizers, a second of the recognizers 10. The computer implemented method of claim 9, further comprising making a subsequent selection of a recognizer, wherein the subsequent selection is made before processing of the utterance is completed by the first of the automatic speech recognizers. Method.

A non-transitory computer readable storage medium storing executable computer program code for processing an interaction, wherein the interaction includes an utterance that requires recognition before it can be used for further computer execution processing, When the computer program code is executed by the computer system,
Receiving data representing an utterance from a computer application, the utterance being received from a customer device over a computer network;
Using the parameters provided by the application, an automatic speech recognizer (ASR) and a human speech recognition recognizer that communicates over a computer network with a device that is different from the automatic speech recognizer and remote from the computer system Dynamically selecting one or more recognizers from (HSR) before recognizing the interaction ;
A non-transitory computer readable storage medium for causing a recognition result to be provided in response to a result of processing by the one or more recognizers.

The non-transitory computer readable storage medium of claim 17, wherein the dynamic selection is responsive to a system load metric.

The non-transitory computer readable storage medium of claim 17, wherein the dynamic selection is responsive to a reliability metric.

The non-transitory computer readable storage medium of claim 19, wherein the reliability metric includes a threshold, and the threshold varies based on resource availability.