JP7497384B2

JP7497384B2 - Text conversion support device and text conversion support method

Info

Publication number: JP7497384B2
Application number: JP2022053293A
Authority: JP
Inventors: 義明山添; 稜松本; 洋輔谷澤; 雪城高橋
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2022-03-29
Filing date: 2022-03-29
Publication date: 2024-06-10
Anticipated expiration: 2042-03-29
Also published as: JP2023146216A

Description

本発明は、テキスト化支援装置及びテキスト化支援方法に関するものである。 The present invention relates to a text conversion assistance device and a text conversion assistance method.

営業員やコールセンタ等における通話内容が、コンプライアンス等の観点に照らして適切か確認するニーズが存在する。また近年では、そうした通話内容の録音データを聞き直して確認するといった旧来手法ではなく、当該音声データのテキスト化を行った上で確認対象とする手法も提案されている。
そうしたテキスト化に関連する従来技術としては、商談や営業活動の際の顧客への説明内容等のデータに基づいて、「禁止表現」の有無、および「必要事項」が含まれているか否かのいずれについてもチェック対象とするコンプライアンスチェックシステムおよびコンプライアンスチェックプログラム（特許文献１参照）などが提案されている。 There is a need to check whether the contents of phone calls made by salespeople or call centers are appropriate from the perspective of compliance, etc. In recent years, instead of the traditional method of listening to and checking the recorded data of such phone calls, a method has been proposed in which the voice data is converted into text and then checked.
Examples of prior art related to such text conversion include a compliance check system and compliance check program (see Patent Document 1) that checks whether or not "prohibited expressions" are present and whether or not "required information" is included, based on data such as the content of explanations given to customers during business negotiations and sales activities.

この技術は、業担当者が顧客に対して行った各発話についてコンプライアンスを遵守しているかをチェックするコンプライアンスチェックシステムであって、前記営業担当者の前記各発話の内容を音声認識技術によりテキスト化したテキストデータに対して、形態素解析を含む自然言語解析処理を行って解析済テキストデータとして出力するテキスト解析部と、前記各発話に係る前記解析済テキストデータ内の各発話について、所定の基準に従って連続する１つ以上の発話からなるブロックにまとめ、前記各ブロックにおいて、顧客に対して説明するべき必要事項として予め定義された第１のテキストデータの内容が説明されているか否かを判定する判定部と、前記各発話に係る前記解析済テキストデータについて、顧客に対して述べてはいけない禁止表現の内容として予め定義された第２のテキストデータにマッチするものがある場合に、対象の前記発話において対象の前記禁止表現が述べられたものと判定するキーワードマッチング部と、前記営業担当者が前記顧客に対して行った前記各発話のデータに前記営業担当者および／または前記顧客を特定する管理情報と関連付けて記録するデータ記録部と、を有し、前記テキスト解析部は、前記営業担当者が前記顧客に対して行った前記各発話のデータに、前記管理情報に基づいて抽出される前記営業担当者が前記顧客に対して行った過去の発話についても含め、前記判定部は、前記ブロックにおいて、前記第１のテキストデータの内容が説明されていると判定した場合に、前記ブロックに対して前記必要事項のカテゴリを付与して記録するとともに、前記必要事項のそれぞれについて、予め設定した所定の評価基準に基づいて、説明された度合を判定するシステムである。 This technology is a compliance check system that checks whether compliance is being observed with respect to each utterance made by a sales representative to a customer, and includes a text analysis unit that performs natural language analysis processing, including morphological analysis, on text data that has been converted into text using voice recognition technology from the content of each utterance made by the sales representative, and outputs the resulting analyzed text data; a determination unit that organizes each utterance in the analyzed text data relating to each utterance into blocks consisting of one or more consecutive utterances according to a predetermined criterion, and determines whether the content of first text data that is predefined as necessary matters that should be explained to the customer is explained in each block; and a determination unit that maps the analyzed text data relating to each utterance to second text data that is predefined as the content of prohibited expressions that should not be said to customers. The system has a keyword matching unit that determines that the prohibited expression of the target is mentioned in the target utterance if there is a keyword matching unit that matches the keyword matching unit, and a data recording unit that records the data of each utterance made by the sales representative to the customer in association with management information that identifies the sales representative and/or the customer, and the text analysis unit includes past utterances made by the sales representative to the customer that are extracted based on the management information in the data of each utterance made by the sales representative to the customer, and the determination unit, when it determines that the content of the first text data is explained in the block, assigns a category of the necessary items to the block and records it, and determines the degree to which each of the necessary items has been explained based on a predetermined evaluation criterion that has been set in advance.

特開２０１８－１２０６４０号公報JP 2018-120640 A

上述のようなテキスト化については、深層学習技術等の進展によって精度向上が図られてきおり、その利活用が進んでいる。例えば、金融分野における通話録音データの利活用の一例として、ＮＧワードの発言有無、正しい顧客名、商品名の発音有無をチェックするといったものがある。
当該チェックに際しては、通話録音データをテキスト化したものに対して、キーワードマッチングを行うケースが多い。ところが、録音状況や発話者の癖などの要因により、テキスト化の精度が低くなりやすい通話（誤検知が多い通話）の存在も判明しており、こうした通話に関して、精度良くキーワードマッチングを行うことは困難であった。 The accuracy of the above-mentioned text conversion has been improved through the advancement of deep learning technology, etc., and its utilization is progressing. For example, one example of the utilization of recorded telephone conversation data in the financial field is to check whether or not inappropriate words are used, and whether or not customer names and product names are pronounced correctly.
In many cases, this check involves keyword matching of the text of recorded conversation data. However, it has been found that there are calls where the accuracy of the text conversion is low (calls with many false positives) due to factors such as the recording conditions and the speaker's habits, making it difficult to perform accurate keyword matching on such calls.

つまり、音声テキスト化の精度が低くなりがちな通話に関してキーワードマッチングを行うとしても、その精度は期待出来ず、結局のところチェック漏れが発生してしまう要因となっている。 In other words, even if keyword matching is performed on phone calls, where the accuracy of voice-to-text conversion tends to be low, the accuracy cannot be expected, which ultimately leads to missed checks.

そこで本発明の目的は、通話内容に関する音声テキスト化の特性に関わらず、当該通話内容のキーワードマッチングを精度良好に実施可能とする技術を提供することにある。 The object of the present invention is to provide a technology that enables accurate keyword matching of the contents of a call, regardless of the characteristics of the speech-to-text conversion of the contents of the call.

上記課題を解決する本発明のテキスト化支援装置は、会話の場面ないし対象ごとに出現が想定される各語彙の正しい音素の情報を規定したマスタデータと、母音間の発話類似度を規定した情報とを保持する記憶装置と、所定装置から得た通話録音データを音響モデルに適用して音素を抽出する処理と、前記マスタデータで音素が規定された語彙のうち、前記通話録音データの会話の場面ないし対象に関して出現が想定されている語彙の前記正しい音素と、前記抽出した音素との一致率を算定するに際し、前記正しい音素及び前記抽出した音素のそれぞれに含まれる母音間の一致率を、前記発話類似度の情報に基づいて算定し、前記正しい音素及び前記抽出した音素のそれぞれに含まれる子音間の一致率を算定し、前記母音間の一致率を前記子音間の一致率よりも優位に重み付けて、前記母音間及び前記子音間の各一致率に基づき前記音素同士の一致率を算定する処理と、前記算定の結果、音素同士が所定の一致率を示す前記語彙をキーワードマッチング結果として特定する処理を実行する演算装置と、を含むことを特徴とする。
また、本発明のテキスト化支援方法は、情報処理装置が、会話の場面ないし対象ごとに出現が想定される各語彙の正しい音素の情報を規定したマスタデータと、母音間の発話類似度を規定した情報とを記憶装置にて保持し、所定装置から得た通話録音データを音響モデルに適用して音素を抽出する処理と、前記マスタデータで音素が規定された語彙のうち、前記通話録音データの会話の場面ないし対象に関して出現が想定されている語彙の前記正しい音素と、前記抽出した音素との一致率を算定するに際し、前記正しい音素及び前記抽出した音素のそれぞれに含まれる母音間の一致率を、前記発話類似度の情報に基づいて算定し、前記正しい音素及び前記抽出した音素のそれぞれに含まれる子音間の一致率を算定し、前記母音間の一致率を前記子音間の一致率よりも優位に重み付けて、前記母音間及び前記子音間の各一致率に基づき前記音素同士の一致率を算定する処理と、前記算定の結果、音素同士が所定の一致率を示す前記語彙をキーワードマッチング結果として特定する処理、を実行することを特徴とする。
The text conversion assistance device of the present invention that solves the above problem includes a storage device that stores master data that specifies correct phoneme information for each vocabulary expected to appear in each conversation scene or object , and information that specifies speech similarity between vowels , a process that applies call recording data obtained from a specified device to an acoustic model to extract phonemes, and a calculation of a match rate between the extracted phonemes and the correct phonemes of vocabulary expected to appear in the conversation scene or object of the call recording data among the vocabulary whose phonemes are specified in the master data. The method includes a process of calculating a matching rate between a phoneme and vowels contained in each of the extracted phonemes based on information about the speech similarity, calculating a matching rate between the correct phoneme and consonants contained in each of the extracted phonemes, weighting the matching rate between the vowels more heavily than the matching rate between the consonants, and calculating a matching rate between the phonemes based on the matching rates between the vowels and the consonants, and a calculation device that executes a process of identifying, as a keyword matching result, the vocabulary whose phonemes show a predetermined matching rate as a result of the calculation.
The text conversion assistance method of the present invention includes a process in which an information processing device stores master data defining correct phoneme information for each vocabulary expected to appear in each conversation scene or object , and information defining speech similarity between vowels, in a storage device, and applies call recording data obtained from a specified device to an acoustic model to extract phonemes; and when calculating a match rate between the correct phonemes of vocabulary expected to appear in relation to the conversation scene or object of the call recording data, among the vocabulary whose phonemes are defined in the master data, and the extracted phonemes , The method is characterized by executing a process of calculating a matching rate between vowels contained in each of the correct phonemes and the extracted phonemes based on the speech similarity information, calculating a matching rate between consonants contained in each of the correct phonemes and the extracted phonemes, weighting the matching rate between the vowels more heavily than the matching rate between the consonants, and calculating a matching rate between the phonemes based on the matching rates between the vowels and the consonants as a result of the calculation, and a process of identifying the vocabulary whose phonemes show a predetermined matching rate as a keyword matching result.

本発明によれば、通話内容に関する音声テキスト化の特性に関わらず、当該通話内容のキーワードマッチングを精度良好に実施可能となる。 According to the present invention, it is possible to perform keyword matching of the contents of a call with high accuracy, regardless of the characteristics of the speech-to-text conversion of the contents of the call.

本実施形態のテキスト化支援装置を含むネットワーク構成図である。1 is a diagram showing a network configuration including a text conversion assistance device according to an embodiment of the present invention. 本実施形態におけるテキスト化支援装置のハードウェア構成例を示す図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of the text conversion assistance device according to the present embodiment. 本実施形態におけるオペレータ端末のハードウェア構成例を示す図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of an operator terminal according to the present embodiment. 本実施形態におけるコールセンタシステムのハードウェア構成例を示す図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of a call center system according to the present embodiment. 本実施形態における管理者端末のハードウェア構成例を示す図である。2 is a diagram illustrating an example of a hardware configuration of an administrator terminal according to the present embodiment. FIG. 本実施形態の通話録音ＤＢの構成例を示す図である。2 is a diagram showing an example of the configuration of a call recording DB according to the present embodiment; FIG. 本実施形態の音素マスタテーブルの構成例を示す図である。FIG. 4 is a diagram showing an example of the configuration of a phoneme master table according to the present embodiment. 本実施形態の発話類似度テーブルの構成例を示す図である。FIG. 4 is a diagram illustrating an example of a configuration of an utterance similarity table according to the present embodiment. 本実施形態におけるテキスト化支援方法のフロー例１を示す図である。FIG. 2 is a diagram showing a first flow example of a text conversion assistance method according to the present embodiment. 本実施形態におけるテキスト化支援方法のフロー例２を示す図である。FIG. 11 is a diagram showing a second flow example of the text conversion assistance method in this embodiment. 本実施形態におけるテキスト化支援方法のフロー例３を示す図である。FIG. 11 is a diagram showing a flow example 3 of the text conversion assistance method in this embodiment.

＜ネットワーク構成＞
以下に本発明の実施形態について図面を用いて詳細に説明する。図１は、本実施形態のテキスト化支援装置１００を含むネットワーク構成図である。図１に示すテキスト化支援装置１００は、通話内容に関する音声テキスト化の特性に関わらず、当該通話内容のキーワードマッチングを精度良好に実施可能とするコンピュータである。 <Network Configuration>
An embodiment of the present invention will be described in detail below with reference to the drawings. Fig. 1 is a network configuration diagram including a text conversion assistance device 100 according to the present embodiment. The text conversion assistance device 100 shown in Fig. 1 is a computer that can perform keyword matching of the contents of a telephone call with good accuracy, regardless of the characteristics of the speech-to-text conversion of the contents of the telephone call.

本実施形態のテキスト化支援装置１００は、図１で示すように、インターネットや組織内のセキュアな回線などの適宜なネットワーク１を介して、オペレータ端末２００、コールセンタシステム３００、及び管理者端末４００と、必要に応じて通信可能に接続されている。よって、これらを総称してテキスト化システム１０としてもよい。 As shown in FIG. 1, the text conversion assistance device 100 of this embodiment is communicatively connected to an operator terminal 200, a call center system 300, and an administrator terminal 400 as necessary via an appropriate network 1, such as the Internet or a secure line within an organization. Therefore, these may be collectively referred to as the text conversion system 10.

本実施形態のテキスト化支援装置１００は、例えば、コールセンタでのオペレータと顧客との会話内容がコンプライアンスや顧客対応の観点で適切であったか、会話中でのＮＧワードの出現や、或いは必須ワードの不出現といった事象についてキーワードマッチングで特定する支援装置と言える。 The text conversion support device 100 of this embodiment can be said to be a support device that uses keyword matching to identify events such as whether the content of a conversation between an operator and a customer at a call center was appropriate from the standpoint of compliance or customer service, or whether inappropriate words appeared in the conversation or whether essential words did not appear.

勿論、コールセンタ業務におけるオペレータと顧客との会話に関してキーワードマッチングを行う状況のみを本発明の適用対象とするのみならず、音声データ中に必要な／禁忌のキーワードの出現状況を検証する機会が存在する業務等であれば、いずれについても適用可能である。 Of course, the present invention is not limited to only situations where keyword matching is performed in conversations between operators and customers in call center operations, but can also be applied to any operations where there is an opportunity to verify the occurrence of necessary/prohibited keywords in voice data.

一方、オペレータ端末２００は、種々の商品やサービスに関する顧客からの問合せへの対応業務、或いは見込み客等に対する電話営業を行う担当者が使用する端末である。具体的には、ＰＣと一体となった電話端末、スマートフォン、タブレット端末、パーソナルコンピュータなどを想定できる。こうしたオペレータ端末２００での担当者と顧客との間の会話が録音され、通話録音データとして管理、活用されることとなる。 On the other hand, the operator terminal 200 is a terminal used by a person in charge of responding to customer inquiries regarding various products and services, or conducting telephone sales to potential customers, etc. Specifically, it can be a telephone terminal integrated with a PC, a smartphone, a tablet terminal, a personal computer, etc. Conversations between the person in charge and the customer on such an operator terminal 200 are recorded, and are managed and utilized as call recording data.

また、コールセンタシステム３００は、上述のオペレータ端末２００と顧客の電話機との間で発着信の管理や、上述のオペレータ端末２００での会話内容である通話録音データを管理するシステムとなる。よって、コールセンタシステム３００は、通話録音データを記憶装置にて保持・管理し、テキスト化支援装置１００に適宜配信する。 The call center system 300 also manages incoming and outgoing calls between the operator terminal 200 and the customer's telephone, and manages recorded call data, which is the content of conversations at the operator terminal 200. Therefore, the call center system 300 stores and manages the recorded call data in a storage device, and distributes it to the text conversion support device 100 as appropriate.

また、管理者端末４００は、上述のコールセンタの管理者が操作する端末である。この管理者端末４００は、当該コールセンタでの業務終了時など適宜なタイミングで、一日など所定期間分の通話録音データに関して、上述のコンプライアンス等の所定観点でのチェックを行うべくキーワードマッチング処理の指示を、テキスト化支援装置１００に行い、その処理結果を取得する端末となる。
＜ハードウェア構成＞
また、本実施形態のテキスト化支援装置１００のハードウェア構成は、図２に以下の如くとなる。 The administrator terminal 400 is a terminal operated by the administrator of the call center. The administrator terminal 400 instructs the text conversion assistance device 100 to perform keyword matching processing at an appropriate timing, such as when the business hours at the call center are finished, to check the call recording data for a predetermined period, such as one day, from a predetermined viewpoint, such as the above-mentioned compliance, and obtains the processing result.
<Hardware Configuration>
The hardware configuration of the text conversion assistance device 100 of this embodiment is as shown in FIG.

すなわちテキスト化支援装置１００は、記憶装置１０１、メモリ１０３、演算装置１０４、および通信装置１０５、を備える。 That is, the text conversion assistance device 100 includes a storage device 101, a memory 103, a computing device 104, and a communication device 105.

このうち記憶装置１０１は、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）やハードディスクドライブなど適宜な不揮発性記憶素子で構成される。 Of these, the storage device 101 is composed of an appropriate non-volatile storage element such as an SSD (Solid State Drive) or a hard disk drive.

また、メモリ１０３は、ＲＡＭなど揮発性記憶素子で構成される。 In addition, memory 103 is composed of a volatile memory element such as RAM.

また、演算装置１０４は、記憶装置１０１に保持されるプログラム１０２をメモリ１０３に読み出すなどして実行し装置自体の統括制御を行なうとともに各種判定、演算及び制御処理を行なうＣＰＵである。 The calculation device 104 is a CPU that executes the program 102 stored in the storage device 101 by reading it into the memory 103, etc., and performs overall control of the device itself as well as various judgments, calculations, and control processes.

また、通信装置１０５は、ネットワーク１と接続して、少なくともコールセンタシステム３００との通信処理を担うネットワークインターフェイスカード等を想定する。 The communication device 105 is assumed to be a network interface card or the like that is connected to the network 1 and handles at least communication processing with the call center system 300.

なお、テキスト化支援装置１００がスタンドアロンマシンである場合、ユーザからのキー入力や音声入力を受け付ける入力装置、処理データの表示を行うディスプレイ等の出力装置、を更に備えるとすれば好適である。 If the text conversion assistance device 100 is a standalone machine, it is preferable for it to further include an input device that accepts key input or voice input from the user, and an output device such as a display that displays the processed data.

また、記憶装置１０１内には、本実施形態のテキスト化支援装置として必要な機能を実装する為のプログラム１０２に加えて、通話録音ＤＢ１２５、音素マスタテーブル１２６、及び発話類似度テーブル１２７６が少なくとも記憶されている。ただし、これらデータベース等についての詳細は後述する。 In addition to the program 102 for implementing the functions required for the text conversion assistance device of this embodiment, at least a call recording DB 125, a phoneme master table 126, and a speech similarity table 1276 are stored in the storage device 101. However, details of these databases will be described later.

また、プログラム１０２は、音響モデル１１０、及び言語モデル１１１を備えるものとする。音響モデル１１０は、オペレータと顧客との間の会話に関する通話録音データから当該通話の音声を構成する音素を抽出する機能である。 The program 102 also includes an acoustic model 110 and a language model 111. The acoustic model 110 is a function for extracting phonemes that constitute the voice of a conversation between an operator and a customer from call recording data.

そのため、テキスト化支援装置１００は、通話録音データが示す音声の特徴量（周波数や音の強弱）を分析し、取扱いしやすいデータとして変換する音響分析を事前に実行し、この音響分析結果が示す特徴量を音響モデル１１０に与えることになる。 Therefore, the text conversion assistance device 100 performs an acoustic analysis in advance to analyze the voice features (frequency and volume) indicated by the call recording data and convert it into data that is easy to handle, and provides the features indicated by the results of this acoustic analysis to the acoustic model 110.

音響モデル１１０は、適宜な深層学習などにより、上述の特徴量と音素との対応関係を規定したモデルであって、上述の音声の特徴量を与えることで、音波の最小単位である音素を抽出する。 The acoustic model 110 is a model that defines the correspondence between the above-mentioned features and phonemes through appropriate deep learning, etc., and extracts phonemes, which are the smallest units of sound waves, by providing the above-mentioned voice features.

なお、音素とは、音声を発したときに観測できる音波の最小構成要素である。日本語における音素は、母音（アイウエオ）、擬音（ン）、子音（２３種類）の計３種類から成り立っている。例えば、「田中さん」の場合は、「t-a-n-a-k-a-s-a-n」が音素となる。 A phoneme is the smallest component of a sound wave that can be observed when a sound is produced. In Japanese, phonemes consist of three types: vowels (aiueo), onomatopoeia (n), and consonants (23 types). For example, the phonemes for "Tanaka-san" are "t-a-n-a-k-a-s-a-n."

本実施形態のテキスト化支援装置１００は、音響モデル１１０により得た音素に基づいて、キーワードマッチングを行うこととなる。上述の場合、音素「t-a-n-a-k-a-s-a-n」
を、「田中さん」という日本語の語彙として特定する処理が該当する。より具体的には、各音素がどの単語に該当するか、音素マスタテーブル１２６を適宜利用しつつ、本発明のテキスト化支援方法を適用することで、音素を語彙に置換していく。 The assistive text generation device 100 of this embodiment performs keyword matching based on the phonemes obtained by the acoustic model 110. In the above example, the phoneme "tanakasan"
is a process of identifying the Japanese vocabulary "Tanaka-san." More specifically, the phonemes are replaced with vocabulary by applying the text conversion assistance method of the present invention while appropriately using the phoneme master table 126 to determine which word each phoneme corresponds to.

一方、言語モデル１１１は、キーワードマッチングで得た語彙の群れを適宜に文章化する処理を担うものとなる。例えば、「田中さん」、「信州では」、「雪が」、「積もりましたよ」、といった語彙の群れを、語彙の群れと正しい（或いは高頻度で出現する）一文との関係についての統計データ等に基づいて、可能性の高い組み合わせ例として意味ある文章を構成する。 On the other hand, the language model 111 is responsible for the process of appropriately converting the vocabulary clusters obtained by keyword matching into sentences. For example, it converts vocabulary clusters such as "Mr. Tanaka," "In Shinshu," "Snow," and "It's piled up" into meaningful sentences as examples of highly probable combinations based on statistical data about the relationship between the vocabulary clusters and correct (or frequently occurring) sentences.

また、本実施形態のオペレータ端末２００のハードウェア構成は、図３に以下の如くとなる。 The hardware configuration of the operator terminal 200 in this embodiment is as shown in Figure 3.

すなわちオペレータ端末２００は、記憶装置２０１、メモリ２０３、演算装置２０４、入力装置２０５、出力装置２０６、および通信装置２０７、を備える。 That is, the operator terminal 200 includes a storage device 201, a memory 203, a computing device 204, an input device 205, an output device 206, and a communication device 207.

このうち記憶装置２０１は、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）やハードディスクドライブなど適宜な不揮発性記憶素子で構成される。 Of these, the storage device 201 is composed of an appropriate non-volatile storage element such as an SSD (Solid State Drive) or a hard disk drive.

また、メモリ２０３は、ＲＡＭなど揮発性記憶素子で構成される。 In addition, memory 203 is composed of a volatile memory element such as RAM.

また、演算装置２０４は、記憶装置２０１に保持されるプログラム２０２をメモリ２０３に読み出すなどして実行し装置自体の統括制御を行なうとともに各種判定、演算及び制
御処理を行なうＣＰＵである。 The arithmetic unit 204 is a CPU that reads out the program 202 stored in the storage unit 201 into the memory 203 and executes the program, thereby controlling the device itself and carrying out various types of judgment, calculation and control processing.

また、入力装置２０５は、ユーザたるオペレータからのキー入力や音声入力を受け付けるキーボードやマウスといった装置で構成される。 The input device 205 is also composed of devices such as a keyboard and mouse that accept key input and voice input from the user/operator.

また、出力装置２０６は、演算装置２０４での処理結果の表示を行うディスプレイやスピーカー等の装置で構成される。 The output device 206 is composed of devices such as a display and a speaker that display the processing results of the calculation device 204.

また、通信装置２０７は、ネットワーク１と接続して、コールセンタシステム３００や管理者端末４００（あるいはテキスト化支援装置１００）との通信処理を担うネットワークインターフェイスカード等を想定する。 The communication device 207 is assumed to be a network interface card or the like that is connected to the network 1 and handles communication processing with the call center system 300 and the administrator terminal 400 (or the text conversion assistance device 100).

また、本実施形態のコールセンタシステム３００のハードウェア構成は、図４に以下の如くとなる。 The hardware configuration of the call center system 300 in this embodiment is as shown in Figure 4.

すなわちコールセンタシステム３００は、記憶装置３０１、メモリ３０３、演算装置３０４、および通信装置３０５、を備える。 That is, the call center system 300 includes a storage device 301, a memory 303, a computing device 304, and a communication device 305.

このうち記憶装置３０１は、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）やハードディスクドライブなど適宜な不揮発性記憶素子で構成される。 Of these, the storage device 301 is composed of an appropriate non-volatile storage element such as an SSD (Solid State Drive) or a hard disk drive.

また、メモリ３０３は、ＲＡＭなど揮発性記憶素子で構成される。 In addition, memory 303 is composed of a volatile memory element such as RAM.

また、演算装置３０４は、記憶装置３０１に保持されるプログラム３０２をメモリ３０３に読み出すなどして実行し装置自体の統括制御を行なうとともに各種判定、演算及び制御処理を行なうＣＰＵである。 The calculation device 304 is a CPU that executes the program 302 stored in the storage device 301 by reading it into the memory 303, etc., to perform overall control of the device itself, as well as various judgments, calculations, and control processes.

また、通信装置３０５は、ネットワーク１と接続して、少なくともテキスト化支援装置１００や、オペレータ端末２００との通信処理を担うネットワークインターフェイスカード等を想定する。 The communication device 305 is assumed to be a network interface card or the like that is connected to the network 1 and handles communication processing with at least the text conversion assistance device 100 and the operator terminal 200.

なお、コールセンタシステム３００がスタンドアロンマシンである場合、ユーザからのキー入力や音声入力を受け付ける入力装置、処理データの表示を行うディスプレイ等の出力装置、を更に備えるとすれば好適である。 If the call center system 300 is a standalone machine, it is preferable for it to further include an input device that accepts key input or voice input from the user, and an output device such as a display that displays the processed data.

また、記憶装置３０１内には、本実施形態のコールセンタシステム３００として必要な機能を実装する為のプログラム３０２に加えて、通話録音データ３２５が少なくとも記憶されている。この通話録音データ３２５は、テキスト化支援装置１００における通話録音ＤＢ１２５のレコードとなるデータである。 In addition to the program 302 for implementing the functions required for the call center system 300 of this embodiment, at least call recording data 325 is stored in the storage device 301. This call recording data 325 is data that becomes a record in the call recording DB 125 in the text conversion assistance device 100.

また、本実施形態の管理者端末４００のハードウェア構成は、図５に以下の如くとなる。 The hardware configuration of the administrator terminal 400 in this embodiment is as shown in Figure 5.

すなわち管理者端末４００は、記憶装置４０１、メモリ４０３、演算装置４０４、入力装置４０５、出力装置４０６、および通信装置４０７、を備える。 That is, the administrator terminal 400 includes a storage device 401, a memory 403, a computing device 404, an input device 405, an output device 406, and a communication device 407.

このうち記憶装置４０１は、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）やハードディスクドライブなど適宜な不揮発性記憶素子で構成される。 Of these, the storage device 401 is composed of an appropriate non-volatile storage element such as an SSD (Solid State Drive) or a hard disk drive.

また、メモリ４０３は、ＲＡＭなど揮発性記憶素子で構成される。 In addition, memory 403 is composed of a volatile memory element such as RAM.

また、演算装置４０４は、記憶装置４０１に保持されるプログラム４０２をメモリ４０３に読み出すなどして実行し装置自体の統括制御を行なうとともに各種判定、演算及び制御処理を行なうＣＰＵである。 The calculation device 404 is a CPU that executes the program 402 stored in the storage device 401 by reading it into the memory 403, etc., to perform overall control of the device itself, as well as various judgments, calculations, and control processes.

また、入力装置４０５は、ユーザたるオペレータからのキー入力や音声入力を受け付けるキーボードやマウスといった装置で構成される。 The input device 405 is composed of devices such as a keyboard and a mouse that accept key input and voice input from the user/operator.

また、出力装置４０６は、演算装置４０４での処理結果の表示を行うディスプレイやスピーカー等の装置で構成される。 The output device 406 is composed of devices such as a display and a speaker that display the processing results of the calculation device 404.

また、通信装置４０７は、ネットワーク１と接続して、テキスト化支援装置１００やコールセンタシステム３００との通信処理を担うネットワークインターフェイスカード等を想定する。
＜データ構造例＞
続いて、本実施形態のテキスト化支援装置１００が用いる各種情報について説明する。図６に、本実施形態における通話録音ＤＢ１２５の一例を示す。本実施形態の通話録音ＤＢ１２５は、例えば、コールセンタシステム３００から（またはオペレータ端末２００から）取得した、オペレータと顧客との間の通話録音データを格納したデータベースである。 The communication device 407 is assumed to be a network interface card or the like that is connected to the network 1 and handles communication processing with the text conversion assistance device 100 and the call center system 300 .
<Data structure example>
Next, various information used by the text conversion assistance device 100 of this embodiment will be described. Fig. 6 shows an example of the call recording DB 125 of this embodiment. The call recording DB 125 of this embodiment is a database that stores call recording data between an operator and a customer, obtained from the call center system 300 (or from the operator terminal 200), for example.

この通話録音ＤＢ１２５は、例えば、通話日時及び通話対象の顧客を示す顧客ＩＤをキーに、当該顧客の氏名、当該顧客から指定された商品・サービス名、対応オペレータのＩＤ、録音データファイル、といったデータを紐付けレコードの集合体となっている。 This call recording DB125 is a collection of records that link data such as the name of the customer, the name of the product or service specified by the customer, the ID of the operator who responded, and the recording data file, using, for example, the date and time of the call and a customer ID indicating the customer who was the subject of the call as keys.

また図７に、本実施形態における音素マスタテーブル１２６の構成例を示す。本実施形態の音素マスタテーブル１２６は、語彙ごとの正しい音素を規定したテーブルである。 Figure 7 shows an example of the configuration of the phoneme master table 126 in this embodiment. The phoneme master table 126 in this embodiment is a table that specifies the correct phonemes for each vocabulary.

この音素マスタテーブル１２６は、例えば、会話の場面や対象をキーとして、それら場面や対象に関する会話中に出現が想定される語彙の正しい音素の情報を規定した構成となっている。 This phoneme master table 126 is configured to, for example, use a conversation scene or subject as a key and specify the correct phoneme information for vocabulary that is expected to appear in a conversation related to that scene or subject.

また図８に、本実施形態における発話類似度テーブル１２７の構成例を示す。本実施形態の発話類似度テーブル１２７は、日本語の母音を発話した場合の各間における類似度を規定したテーブルである。 Figure 8 also shows an example of the configuration of the speech similarity table 127 in this embodiment. The speech similarity table 127 in this embodiment is a table that specifies the similarity between each space when Japanese vowels are spoken.

この発話類似度テーブル１２７は、縦横に母音を列挙し、母音それぞれの間での類似度を、最大値１（完全一致）から最小値０（類似度ゼロ）までの間の非連続な数値で規定したマトリクスを構成している。
＜フロー例１＞
以下、本実施形態におけるテキスト化支援方法の実際手順について図に基づき説明する。以下で説明するテキスト化支援方法に対応する各種動作は、テキスト化支援装置１００がメモリ等に読み出して実行するプログラムによって実現される。そして、このプログラムは、以下に説明される各種の動作を行うためのコードから構成されている。 This speech similarity table 127 is a matrix in which vowels are listed vertically and horizontally, and the similarity between each vowel is defined as a non-consecutive numerical value between a maximum value of 1 (perfect match) and a minimum value of 0 (zero similarity).
<Flow Example 1>
The actual procedure of the text conversion assistance method in this embodiment will be described below with reference to the drawings. The various operations corresponding to the text conversion assistance method described below are realized by a program that is read into a memory or the like and executed by the text conversion assistance device 100. This program is composed of codes for performing the various operations described below.

図９は、本実施形態におけるテキスト化支援方法のフロー例１を示す図である。この場合、テキスト化支援装置１００は、例えば、コールセンタシステム３００（ないしオペレータ端末２００）から、通話録音データ３２５を取得し、これを通話録音ＤＢ１２５に格納する（ｓ１）。 Figure 9 is a diagram showing a flow example 1 of the text conversion assistance method in this embodiment. In this case, the text conversion assistance device 100 acquires call recording data 325, for example, from the call center system 300 (or the operator terminal 200), and stores it in the call recording DB 125 (s1).

また、テキスト化支援装置１００は、予め定めたタイミングの到来を検知して、または管理者端末４００からの指示を受けて、通話録音ＤＢ１２５で保持する通話録音データのうち、例えば、所定期間に関するものを抽出し、これを音響モデル１１０に適用することで、音素を抽出する（ｓ２）。 In addition, the text conversion assistance device 100 detects the arrival of a predetermined timing or receives an instruction from the administrator terminal 400, and extracts, for example, data relating to a predetermined period from the call recording data stored in the call recording DB 125, and applies this to the acoustic model 110 to extract phonemes (s2).

例えば、コールセンタのオペレータが「佐伯」という顧客に対して、定型の挨拶の後、「佐伯さん」という発話を行っていた通話録音データに関して、「Ｔ－Ａ－Ｉ－Ｋ－Ｉ－Ｓ－Ａ－Ｎ」という音素配列を抽出したとする。ここでは顧客氏名を処理対象としたが、これは一例であって、例えば、金融商品名を処理対象とするとしても好適である。 For example, suppose a phoneme sequence of "T-A-I-K-I-S-A-N" is extracted from a recorded call in which a call center operator, after a standard greeting, speaks to a customer named "Saeki," and then says "Saeki-san." Here, the customer's name is the subject of processing, but this is just one example, and it would also be appropriate to process, for example, the name of a financial product.

続いて、テキスト化支援装置１００は、上述の通話録音データに紐付く顧客ＩＤから、当該通話対象の顧客が「佐伯」さんであることを特定し、この「佐伯さん」をキーワードマッチング対象の語彙として、その音素を音素マスタテーブル１２６から抽出する（ｓ３）。この場合、「Ｓ－Ａ－Ｅ－Ｋ－Ｉ－Ｓ－Ａ－Ｎ」という音素配列が、音素マスタテーブル１２６における顧客ＩＤ「Ｃ０１８１２２：佐伯＊＊＊」のレコードから抽出される。 Next, the text conversion assistance device 100 identifies the customer of the call as "Saeki" from the customer ID linked to the above-mentioned call recording data, and extracts the phonemes of "Saeki" from the phoneme master table 126 as a vocabulary for keyword matching (s3). In this case, the phoneme sequence "S-A-E-K-I-S-A-N" is extracted from the record of customer ID "C018122: Saeki ***" in the phoneme master table 126.

続いて、テキスト化支援装置１００は、ｓ２、ｓ３でそれぞれ得た音素配列を比較し、その一致率を算定する（ｓ４）。上述の場合、「Ｔ－Ａ－Ｉ－Ｋ－Ｉ－Ｓ－Ａ－Ｎ」という音素配列と、「Ｓ－Ａ－Ｅ－Ｋ－Ｉ－Ｓ－Ａ－Ｎ」という音素配列を照合すると、全８音素のうち、６つの音素が一致しており、６／８＝０．７５が一致率となる。 Next, the text conversion assistance device 100 compares the phoneme sequences obtained in s2 and s3, respectively, and calculates the match rate (s4). In the above example, when the phoneme sequence "T-A-I-K-I-S-A-N" is compared with the phoneme sequence "S-A-E-K-I-S-A-N," six of the eight phonemes match, resulting in a match rate of 6/8 = 0.75.

もし、従来どおり、通話録音データから得た「Ｔ－Ａ－Ｉ－Ｋ－Ｉ－Ｓ－Ａ－Ｎ」という音素配列を言語モデル１１１に適用し、「大輝さん」というテキストを得て、これと、音素マスタテーブル１２６で規定の語彙「佐伯さん」というテキストと照合した場合、その一致率は、全４文字のうち２文字の一致で、一致率を２／４＝０．５と算定することになる。キーワードマッチングの合否基準が、例えば一致率０．６であると、オペレータとしては確かに「佐伯さん」と顧客名を発話しているにも関わらず、言語モデル１１１での変換精度の影響によって、これらはマッチングしないと判定されることになってしまう。 If, as in the past, the phoneme sequence "T-A-I-K-I-S-A-N" obtained from the call recording data is applied to the language model 111 to obtain the text "Daiki-san," and this is compared with the text "Saeki-san," a vocabulary specified in the phoneme master table 126, the match rate will be calculated as 2/4 = 0.5, since two out of four characters match. If the pass/fail criterion for keyword matching is, for example, a match rate of 0.6, then even though the operator has certainly spoken the customer's name as "Saeki-san," it will be determined that they do not match due to the influence of the conversion accuracy in the language model 111.

一方、本発明のテキスト化支援装置１００によれば、こうした言語モデル１１１での変換精度の問題をクリアし、音素配列間の一致率に基づくキーワードマッチングを行うことが可能であり、従来よりも精度良好なキーワードマッチングが可能となっている。
＜フロー例２＞
図１０は、本実施形態におけるテキスト化支援方法のフロー例２を示す図である。ここでは、上述のフロー例１における効果をさらに高めるべく、母音の観点を加えて音素配列の一致度を算定する手法について説明する。なお、本フローにおいては、上述のフロー例１におけるｓ１、ｓ２までは同様であるため、それ以降の処理として説明を行うものとする。 On the other hand, according to the text conversion assistance device 100 of the present invention, it is possible to overcome the problem of conversion accuracy in the language model 111 and perform keyword matching based on the concordance rate between phoneme sequences, thereby enabling keyword matching with higher accuracy than in the past.
<Flow Example 2>
10 is a diagram showing a flow example 2 of the text conversion assistance method in this embodiment. Here, a method of calculating the degree of coincidence of phoneme sequences from the viewpoint of vowels will be described in order to further enhance the effect of the above-mentioned flow example 1. Note that in this flow, the steps up to s1 and s2 in the above-mentioned flow example 1 are the same, so the steps thereafter will be described.

テキスト化支援装置１００は、上述のフロー例１のように抽出した音素配列から母音（ａ、ｉ、ｕ、ｅ、ｏ）だけを抽出する（ｓ１０）。上述の例の場合、「Ａ、Ｉ、Ｉ、Ａ」という母音配列を抽出することになる。 The text conversion assistance device 100 extracts only the vowels (a, i, u, e, o) from the phoneme sequence extracted as in flow example 1 above (s10). In the above example, the vowel sequence "A, I, I, A" is extracted.

また、テキスト化支援装置１００は、上述の通話対象の顧客「佐伯」さんに関する、音素および母音の抽出をｓ３、ｓ１０と同様に実行する（ｓ１１）。この場合、「Ｓ－Ａ－Ｅ－Ｋ－Ｉ－Ｓ－Ａ－Ｎ」という音素配列から、母音配列「Ａ、Ｅ、Ｉ、Ａ」を抽出することになる。 The text conversion assistance device 100 also extracts phonemes and vowels for the above-mentioned customer "Saeki" in the same manner as in s3 and s10 (s11). In this case, the vowel sequence "A, E, I, A" is extracted from the phoneme sequence "S-A-E-K-I-S-A-N."

続いて、テキスト化支援装置１００は、ｓ１０、ｓ１１でそれぞれ得た母音配列におけ
る母音を、配列先頭から順に発話類似度テーブル１２７に照合し、母音配列間で対応する位置同士の母音の類似度を特定する（ｓ１２）。 Next, the text conversion assistance device 100 compares the vowels in the vowel sequences obtained in s10 and s11 against the speech similarity table 127, starting from the beginning of the sequence, and identifies the similarity between vowels at corresponding positions in the vowel sequences (s12).

例えば、母音「Ａ」と母音「Ａ」は、発話類似度テーブル１２７によれば類似度「１」、母音「Ａ」と母音「Ｉ」は、発話類似度テーブル１２７によれば類似度「０」、母音「Ａ」と母音「Ｕ」は、発話類似度テーブル１２７によれば類似度「０」、母音「Ａ」と母音「Ｅ」は、発話類似度テーブル１２７によれば類似度「０．５」、母音「Ａ」と母音「Ｏ」は、発話類似度テーブル１２７によれば類似度「０．５」、などと特定する。 For example, according to the speech similarity table 127, the vowel "A" and the vowel "A" have a similarity of "1", the vowel "A" and the vowel "I" have a similarity of "0", the vowel "A" and the vowel "U" have a similarity of "0", the vowel "A" and the vowel "E" have a similarity of "0.5", the vowel "A" and the vowel "O" have a similarity of "0.5", and so on.

その結果、上述の例であれば、「Ａ、Ｉ、Ｉ、Ａ」と「Ａ、Ｅ、Ｉ、Ａ」を照合し、「Ａ」と「Ａ」で類似度「１」、「Ｉ」と「Ｅ」で類似度「０．５」、「Ｉ」と「Ｉ」で類似度「１」、「Ａ」と「Ａ」で類似度「１」、となる。 As a result, in the above example, when "A, I, I, A" is matched with "A, E, I, A", the similarity between "A" and "A" is 1, between "I" and "E" is 0.5, between "I" and "I" is 1, and between "A" and "A" is 1.

そこでテキスト化支援装置１００は、ｓ１２で得た母音ごとの類似度に基づき、上述の音素配列における母音類似度を、（１＋０．５＋１＋１）／４＝０．８７５と算定する（ｓ１３）。 Therefore, based on the similarity for each vowel obtained in s12, the text conversion assistance device 100 calculates the vowel similarity in the above-mentioned phoneme sequence to be (1 + 0.5 + 1 + 1) / 4 = 0.875 (s13).

また、テキスト化支援装置１００は、ｓ２、ｓ３で得ている音素配列に基づき、子音についても一致率を算定する（ｓ１４）。上述の例の場合、「Ｔ－Ａ－Ｉ－Ｋ－Ｉ－Ｓ－Ａ－Ｎ」という音素配列における子音「Ｔ、Ｋ、Ｓ、Ｎ」と、「Ｓ－Ａ－Ｅ－Ｋ－Ｉ－Ｓ－Ａ－Ｎ」という音素配列における子音「Ｓ、Ｋ、Ｓ、Ｎ」を照合すると、全４音素のうち、３つの音素が一致しており、３／４＝０．７５が一致率となる。 The text conversion assistance device 100 also calculates the matching rate for consonants based on the phoneme sequence obtained in s2 and s3 (s14). In the above example, when comparing the consonants "T, K, S, N" in the phoneme sequence "T-A-I-K-I-S-A-N" with the consonants "S, K, S, N" in the phoneme sequence "S-A-E-K-I-S-A-N", three of the four phonemes match, resulting in a matching rate of 3/4 = 0.75.

続いて、テキスト化支援装置１００は、ｓ１３で得た母音類似度に重み付けをした上で、子音の一致率と加重平均を行って、音素配列間の一致率を算定する（ｓ１５）。 Next, the text conversion assistance device 100 weights the vowel similarity obtained in s13, and calculates the match rate between the phoneme sequences by taking a weighted average with the consonant match rate (s15).

例えば、上述の重み付けを「２」、すなわち子音の一致率より２倍の重みをつけて加重平均を行うとすれば、（子音一致率０．７５＋母音類似度０．８７５×重み２）／３＝０．８３、と一致率を算定できる。
＜フロー例３＞
図１１は、本実施形態におけるテキスト化支援方法のフロー例３を示す図である。ここでは、上述のフロー例１、２における効果をさらに高めるべく、脱字や衍字への対処という観点を加えて音素配列の一致度を算定する手法について説明する。なお、本フローにおいては、上述のフロー例１におけるｓ１、ｓ２、フロー例におけるｓ１０、ｓ１１までは同様であるため、それ以降の処理として説明を行うものとする。 For example, if the weighting is set to "2", that is, the weight is set to twice the consonant matching rate and a weighted average is performed, the matching rate can be calculated as (consonant matching rate 0.75 + vowel similarity 0.875 x weighting 2)/3 = 0.83.
<Flow Example 3>
11 is a diagram showing a flow example 3 of the text conversion assistance method in this embodiment. Here, a method of calculating the degree of coincidence of phoneme sequences with a view to dealing with omitted characters and altered characters will be described in order to further enhance the effects of the above flow examples 1 and 2. Note that in this flow, steps s1 and s2 in the above flow example 1 and steps s10 and s11 in the flow example are the same, so the steps thereafter will be described.

テキスト化支援装置１００は、上述のように抽出した、通話録音データにおける音素配列中の母音配列、及び、音素マスタテーブル１２６の対応レコードから読み出した音素配列中の母音配列のそれぞれに関して、当該母音配列において連続する２つの母音の組みにおける類似度を発話類似度テーブル１２７に基づき特定する（ｓ２０）。 The text conversion assistance device 100 determines the similarity between pairs of two consecutive vowels in the vowel sequence for each of the vowel sequences in the phoneme sequence in the call recording data extracted as described above and the vowel sequence in the phoneme sequence read from the corresponding record in the phoneme master table 126 based on the speech similarity table 127 (s20).

例えば、通話録音データから得た音素配列「Ｏ－Ｈ－Ａ－Ｙ－Ｏ－Ｕ－Ｇ－Ｏ－Ｚ－Ａ－Ｉ－Ｍ－Ａ－Ｓ－Ｕ－Ｓ－Ａ－Ｋ－Ｉ－Ｓ－Ａ－Ｎ」中の母音配列「Ｏ、Ａ、Ｏ、Ｕ、Ｏ、Ａ、Ｉ、Ａ、Ｕ、Ａ、Ｉ、Ａ」では、先頭から２つずつ母音を選択し、組み（１）「Ｏ、Ａ」、組み（２）「Ｏ、Ｕ」、組み（３）「Ｏ、Ａ」、組み（４）「Ｉ、Ａ」、組み（５）「Ｕ、Ａ」、組み（６）「Ｉ、Ａ」といった計６つの組みを形成した場合、発話類似度テーブル１２７に基づき、組み（１）は類似度「０．５」、組み（２）は類似度「０．５」、組み（３）は類似度「０．５」、組み（４）は類似度「０」、組み（５）は類似度「０」、組み（６）は類似度「０」と特定できる。 For example, in the vowel sequence "O, A, O, U, O, A, I, A, U, A, I, A" in the phoneme sequence "O-H-A-Y-O-U-G-O-Z-A-I-M-A-S-U-S-A-K-I-S-A-N" obtained from telephone call recording data, two vowels are selected from each vowel from the beginning to form a total of six pairs, such as pair (1) "O, A", pair (2) "O, U", pair (3) "O, A", pair (4) "I, A", pair (5) "U, A", and pair (6) "I, A", and based on the speech similarity table 127, it can be determined that pair (1) has a similarity of "0.5", pair (2) has a similarity of "0.5", pair (3) has a similarity of "0.5", pair (4) has a similarity of "0", pair (5) has a similarity of "0", and pair (6) has a similarity of "0".

続いて、テキスト化支援装置１００は、ｓ２０で特定した各組みの類似度が例えば０．５といった基準以上の組みについては予め定めた１つの規定母音（例：Ａ、Ｉ、Ｕ）に畳み込み、類似度が基準を下回る組みについては先頭の母音を採用して、後尾の母音を音素配列中において隣接する母音と次なる組みを形成する処理を実行して、音節配列を生成する（ｓ２１）。 Next, the text conversion assistance device 100 folds the pairs identified in s20 whose similarity is equal to or exceeds a criterion, such as 0.5, into a single predefined vowel (e.g., A, I, U), and for pairs whose similarity is below the criterion, it uses the first vowel and executes a process of forming the next pair with the last vowel in the phoneme sequence with an adjacent vowel, thereby generating a syllable sequence (s21).

上述の例の場合、組み（１）は母音「Ａ」に集約（すなわち畳み込み。以下同様）、組み（２）は母音「Ｕ」に集約、組み（３）は母音「Ａ」に集約、組み（４）は先頭の母音「Ｉ」を採用し、後尾の母音「Ａ」を当初の組み（５）の先頭の母音「Ｕ」と組み合わせた新たな組み（５）’を形成し、これ以降の母音の配列についても組みを再構成し、上述の類似度に基づく集約を実行する。 In the above example, set (1) is aggregated to the vowel "A" (i.e., folded; same below), set (2) is aggregated to the vowel "U", set (3) is aggregated to the vowel "A", set (4) takes the first vowel "I" and combines the ending vowel "A" with the first vowel "U" of the original set (5) to form a new set (5)', and then reconstructs the sets for the subsequent vowel arrangements, and performs the aggregation based on similarity described above.

その結果、各組みの集約を経て残った音節配列は、「Ａ、Ｕ、Ａ、Ｉ、Ａ、Ｕ、Ａ、Ｉ、Ａ」となる。 As a result, the syllable sequence remaining after combining each set is "A, U, A, I, A, U, A, I, A."

テキスト化支援装置１００は、こうした音節配列の生成を、音素マスタテーブル１２６で対応するレコードの音素配列「Ｓ－Ａ－Ｓ－Ａ－Ｋ－Ｉ－Ｓ－Ａ－Ｎ」における母音配列「Ａ、Ａ、Ｉ、Ａ」に関しても同様に実行し、「Ａ、Ｉ、Ａ」を得ることになる。 The text conversion assistance device 100 also performs this type of syllable sequence generation for the vowel sequence "A, A, I, A" in the phoneme sequence "S-A-S-A-K-I-S-A-N" of the corresponding record in the phoneme master table 126, thereby obtaining "A, I, A."

次に、テキスト化支援装置１００は、ｓ２１において、通話録音データ由来の音節配列中で、音素マスタテーブル１２６由来で生成した音節配列と一致する箇所について、音素マスタテーブル１２６由来の音節配列と母音数を比較し、当該母音数が等しい場合（ｓ２２：同数）、上述の箇所と音素マスタテーブル１２６由来の音節配列とで、対応する母音配列における母音の一致率を発話類似度テーブル１２７に基づき算定する（ｓ２３）。 Next, in s21, the text conversion assistance device 100 compares the number of vowels in the syllable sequence derived from the phoneme master table 126 for the portion of the syllable sequence derived from the call recording data that matches the syllable sequence generated from the phoneme master table 126, and if the number of vowels is the same (s22: same number), it calculates the matching rate of the vowels in the corresponding vowel sequence between the above-mentioned portion and the syllable sequence derived from the phoneme master table 126 based on the speech similarity table 127 (s23).

例えば、通話録音データの音素配列「Ｏ－Ｈ－Ａ－Ｙ－Ｏ－Ｕ－Ｇ－Ｏ－Ｚ－Ａ－Ｉ－Ｍ－Ａ－Ｓ－Ｕ－Ｓ－Ａ－Ｋ－Ｉ－Ｓ－Ａ－Ｎ」中の母音配列「Ｏ、Ａ、Ｏ、Ｕ、Ｏ、Ａ、Ｉ、Ａ、Ｕ、Ａ、Ｉ、Ａ」のうち、その音素配列が音素マスタテーブル１２６由来の音節配列「Ａ、Ｉ、Ａ」（これは母音配列「Ａ、Ａ、Ｉ、Ａ」に基づく）と一致するのは、「Ｏ、Ａ、Ｉ、Ａ」の箇所である。 For example, of the vowel sequence "O, A, O, U, O, A, I, A, U, A, I, A" in the phoneme sequence "O-H-A-Y-O-U-G-O-Z-A-I-M-A-S-U-S-A-K-I-S-A-N" in the recorded call data, the part where the phoneme sequence matches the syllable sequence "A, I, A" derived from the phoneme master table 126 (which is based on the vowel sequence "A, A, I, A") is "O, A, I, A".

よってテキスト化支援装置１００は、通話録音データ由来の母音配列中「Ｏ、Ａ、Ｉ、Ａ」と、音素マスタテーブル１２６由来の母音配列「Ａ、Ａ、Ｉ、Ａ」との間について、各母音の間の類似度を発話類似度テーブル１２７に基づいて特定し、例えば、（０．５＋１＋１＋１）／４＝０．８７５、などと算定する。 Therefore, the text conversion assistance device 100 determines the similarity between each vowel between the vowel sequence "O, A, I, A" derived from the call recording data and the vowel sequence "A, A, I, A" derived from the phoneme master table 126 based on the speech similarity table 127, and calculates it to be, for example, (0.5 + 1 + 1 + 1) / 4 = 0.875.

一方、上述のｓ２２での母音数の比較の結果、前記通話録音データ由来の母音数よりもマスタテーブル１２６由来の母音数が多い場合（ｓ２２：多）、テキスト化支援装置１００は、脱字が行っていると推定し、マスタテーブル１２６由来の音節配列が正とし、通話録音データ由来の音節配列において母音が欠けている部分について、当該マスタテーブル１２６由来の対応音素で補って補正し（ｓ２４）、この補正が行われた母音配列とマスタテーブル１２６由来の母音配列との間で母音の一致率を発話類似度テーブル１２７に基づき算定する（ｓ２５）。 On the other hand, if the comparison of the number of vowels in s22 above shows that the number of vowels derived from the master table 126 is greater than the number of vowels derived from the call recording data (s22: more), the text conversion assistance device 100 assumes that a character has been omitted, determines that the syllable sequence derived from the master table 126 is correct, and corrects the missing vowels in the syllable sequence derived from the call recording data by filling in the corresponding phonemes from the master table 126 (s24), and calculates the vowel matching rate between the corrected vowel sequence and the vowel sequence derived from the master table 126 based on the speech similarity table 127 (s25).

例えば、通話録音データの音素配列「Ｓ－Ａ－Ｋ－Ｉ－Ｓ－Ａ」中の母音配列「Ａ、Ｉ、Ａ」は、その音素配列が音素マスタテーブル１２６由来の音節配列「Ａ、Ｉ、Ａ」（これは母音配列「Ａ、Ａ、Ｉ、Ａ」に基づく）と一致する。ただし、対応する母音配列中の母音数は、マスタテーブル１２６由来の母音配列の方が１つ多い。 For example, the vowel sequence "A, I, A" in the phoneme sequence "S-A-K-I-S-A" of the call recording data matches the syllable sequence "A, I, A" (based on the vowel sequence "A, A, I, A") derived from the phoneme master table 126. However, the number of vowels in the corresponding vowel sequence is one more in the vowel sequence derived from the master table 126.

そこで、テキスト化支援装置１００は、通話録音データ由来の母音配列「Ａ、Ｉ、Ａ」
のうち、上述のマスタテーブル１２６由来の母音配列「Ａ、Ａ、Ｉ、Ａ」と比べて不足している、すなわち欠けているものが先頭から２番目「Ａ」である。よって、テキスト化支援装置１００は、通話録音データ由来の母音配列「Ａ、Ｉ、Ａ」のうち、先頭「Ａ」と２番目の「Ｉ」の間に、「Ａ」を補って補正する。 Therefore, the text conversion assistance device 100 converts the vowel sequence "A, I, A" from the call recording data into
Among them, the second "A" from the beginning is insufficient, i.e., is missing, compared to the vowel sequence "A, A, I, A" derived from the above-mentioned master table 126. Therefore, the text conversion assistance device 100 corrects the vowel sequence "A, I, A" derived from the call recording data by adding an "A" between the first "A" and the second "I".

また、テキスト化支援装置１００は、上述の補正を行った母音配列と、マスタテーブル１２６由来の母音配列の間の類似度を、発話類似度テーブル１２７に基づいて（１＋１＋１＋１）／４＝１、などと算定することになる。 The text conversion assistance device 100 also calculates the similarity between the vowel sequence that has been corrected as described above and the vowel sequence derived from the master table 126 based on the speech similarity table 127 as (1+1+1+1)/4=1, etc.

他方、上述のｓ２２での母音数の比較の結果、前記通話録音データ由来の母音数よりもマスタテーブル１２６由来の母音数が少ない場合（ｓ２２：少）、テキスト化支援装置１００は、衍字が行っていると推定し、マスタテーブル１２６由来の音節配列が正とし、通話録音データ由来の音節配列において母音が過剰となっている部分について削除して補正し（ｓ２６）、この補正が行われた母音配列とマスタテーブル１２６由来の母音配列との間で母音の一致率を発話類似度テーブル１２７に基づき算定する（ｓ２７）。 On the other hand, if the comparison of the number of vowels in s22 above shows that the number of vowels derived from the master table 126 is less than the number of vowels derived from the call recording data (s22: fewer), the text conversion support device 100 assumes that the syllable sequence derived from the master table 126 is correct, and corrects the syllable sequence derived from the call recording data by deleting the excess of vowels in the syllable sequence (s26), and calculates the vowel matching rate between the corrected vowel sequence and the vowel sequence derived from the master table 126 based on the speech similarity table 127 (s27).

例えば、通話録音データの音素配列「Ａ－Ｋ－Ａ－Ｓ－Ａ－Ｋ－Ｉ－Ｓ－Ａ」中の母音配列「Ａ、Ａ、Ａ、Ｉ、Ａ」は、その音素配列が音素マスタテーブル１２６由来の音節配列「Ａ、Ｉ、Ａ」（これは母音配列「Ａ、Ａ、Ｉ、Ａ」に基づく）と一致する。ただし、対応する母音配列中の母音数は、マスタテーブル１２６由来の母音配列の方が１つ少ない。 For example, the vowel sequence "A, A, A, I, A" in the phoneme sequence "A-K-A-S-A-K-I-S-A" of the call recording data matches the syllable sequence "A, I, A" (based on the vowel sequence "A, A, I, A") derived from the phoneme master table 126. However, the number of vowels in the corresponding vowel sequence is one less in the vowel sequence derived from the master table 126.

そこで、テキスト化支援装置１００は、通話録音データ由来の母音配列「Ａ、Ａ、Ａ、Ｉ、Ａ」のうち、上述のマスタテーブル１２６由来の母音配列「Ａ、Ａ、Ｉ、Ａ」と比べて過剰となっているものが先頭の「Ａ」である。よって、テキスト化支援装置１００は、通話録音データ由来の母音配列「Ａ、Ａ、Ａ、Ｉ、Ａ」のうち、先頭「Ａ」を削除して補正する。 Then, in the vowel sequence "A, A, A, I, A" derived from the call recording data, the first "A" is excessive compared to the vowel sequence "A, A, I, A" derived from the master table 126 described above. Therefore, the text conversion assistance device 100 corrects the vowel sequence "A, A, A, I, A" derived from the call recording data by deleting the first "A".

なお、既にフロー例２で説明しているため、こうした母音配列の類似度にあわせて、子音配列の一致度も考慮して一致率を算定する概念についての説明は省略する。 Note that, as this has already been explained in Flow Example 2, the concept of calculating the match rate by taking into account the degree of agreement of consonant sequences in addition to the degree of similarity of vowel sequences will not be explained here.

以上、本発明を実施するための最良の形態などについて具体的に説明したが、本発明はこれに限定されるものではなく、その要旨を逸脱しない範囲で種々変更可能である。 The above describes in detail the best mode for carrying out the present invention, but the present invention is not limited to this, and various modifications are possible without departing from the spirit of the present invention.

こうした本実施形態によれば、通話内容に関する音声テキスト化の特性に関わらず、当該通話内容のキーワードマッチングを精度良好に実施可能となる。 According to this embodiment, it is possible to perform keyword matching of the contents of a call with high accuracy, regardless of the characteristics of the speech-to-text conversion of the contents of the call.

本明細書の記載により、少なくとも次のことが明らかにされる。すなわち、本実施形態のテキスト化支援装置において、前記記憶装置は、母音間の発話類似度を規定した情報をさらに保持し、前記演算装置は、前記一致率の算定に際し、前記正しい音素及び前記抽出した音素のそれぞれに含まれる母音間の一致率を、前記発話類似度の情報に基づいて算定する処理と、前記正しい音素及び前記抽出した音素のそれぞれに含まれる子音間の一致率を算定する処理と、前記母音間の一致率を前記子音間の一致率よりも優位に重み付けて、前記母音間及び前記子音間の各一致率に基づき前記音素同士の一致率を算定するものである、としてもよい。 The description in this specification makes at least the following clear. That is, in the text conversion assistance device of this embodiment, the storage device may further hold information that specifies the speech similarity between vowels, and the arithmetic device, when calculating the matching rate, may perform a process of calculating the matching rate between the vowels included in each of the correct phoneme and the extracted phoneme based on the speech similarity information, a process of calculating the matching rate between the consonants included in each of the correct phoneme and the extracted phoneme, and a process of weighting the matching rate between the vowels more heavily than the matching rate between the consonants, and calculating the matching rate between the phonemes based on the matching rates between the vowels and between the consonants.

これによれば、上述の音素同士のマッチングに際して、マッチング対象の要素として（種類が少なく区別がしやすい、すなわち誤検知しにくい特性のある）母音を優先することとなり、一致率の精度を良好なものとしやすくなる。ひいては、通話内容に関する音声テキスト化の特性に関わらず、当該通話内容のキーワードマッチングをより精度良好に実施可能となる。 This allows priority to be given to vowels (which are few in number and easy to distinguish, i.e., less prone to false positives) as elements to be matched when matching between the above-mentioned phonemes, which makes it easier to improve the accuracy of the match rate. Ultimately, keyword matching of the contents of the phone call can be performed with higher accuracy, regardless of the characteristics of the speech-to-text conversion of the contents of the phone call.

また、本実施形態のテキスト化支援装置において、前記演算装置は、前記一致率の算定に際し、前記抽出した音素及び前記正しい音素のそれぞれに関して、当該音素に含まれる母音の配列において、連続する２つの母音の組みにおける類似度を前記発話類似度で特定し、前記類似度が基準以上の組みについては予め定めた１つの規定母音に畳み込み、前記類似度が基準を下回る組みについては先頭の母音を採用して、後尾の母音を前記配列において隣接する母音と次なる組みを形成する処理を実行して、音節配列を生成し、前記抽出した音素及び前記正しい音素のそれぞれに関して生成した、前記音節配列の間で母音数を比較し、当該母音数が等しい場合、当該音節配列の元となった、前記抽出した音素及び前記正しい音素のそれぞれの前記母音の配列の間で母音の一致率を前記発話類似度に基づき算定するものである、としてもよい。 In addition, in the text conversion assistance device of this embodiment, when calculating the matching rate, the calculation device may specify, for each of the extracted phoneme and the correct phoneme, the similarity between a pair of two consecutive vowels in the sequence of vowels contained in the phoneme using the speech similarity, convolve pairs with a predetermined specified vowel for pairs with the similarity equal to or higher than a standard, and use the first vowel for pairs with the similarity below the standard to form a next pair with the last vowel in the sequence with an adjacent vowel in the sequence, to generate a syllable sequence, compare the number of vowels between the syllable sequences generated for each of the extracted phoneme and the correct phoneme, and if the number of vowels is equal, calculate the vowel matching rate between the vowel sequences of the extracted phoneme and the correct phoneme that were the basis of the syllable sequence based on the speech similarity.

これによれば、日本語では母音類似度が高い母音が連続する場合、二文字を１音節として発音するケースや、一文字しか発音しないケース、或いは一文字目を発音しないケース、同じ文字を不必要に重ねて発音するケースといった、脱字や衍字などの現象が生じ易いといった問題にも適切に対処することが可能となり、ひいては、通話内容に関する音声テキスト化の特性に関わらず、当該通話内容のキーワードマッチングをより精度良好に実施可能となる。 This makes it possible to deal appropriately with problems such as omissions and altered characters that tend to occur when two characters are pronounced as one syllable, when only one character is pronounced, when the first character is not pronounced, or when the same character is unnecessarily repeated, when vowels with high vowel similarity occur in Japanese. Ultimately, it becomes possible to perform keyword matching of the contents of a phone call with greater accuracy, regardless of the characteristics of the speech-to-text conversion of the contents of the phone call.

また、本実施形態のテキスト化支援装置において、前記演算装置は、前記音節配列の間で母音数を比較し、前記抽出した音素よりも前記正しい音素に対応する音節配列での母音数が多い場合、前記正しい音素の音節配列が正とした場合、前記抽出した音素の音節配列において母音が欠けている部分について、当該正しい音素の対応音素で補って補正し、前記補正が行われた前記抽出した音素及び前記正しい音素のそれぞれの前記母音の配列の間で母音の一致率を前記発話類似度に基づき算定するものである、としてもよい。 In addition, in the text conversion assistance device of this embodiment, the calculation device may compare the number of vowels between the syllable sequences, and if the number of vowels in the syllable sequence corresponding to the correct phoneme is greater than that of the extracted phoneme, determine that the syllable sequence of the correct phoneme is correct, correct the missing vowels in the syllable sequence of the extracted phoneme by filling in the corresponding phoneme of the correct phoneme, and calculate the vowel matching rate between the vowel sequences of the extracted phoneme and the correct phoneme after the correction based on the speech similarity.

これによれば、上述の脱字の事象に対して適切に対処し、通話内容に関する音声テキスト化の特性に関わらず、当該通話内容のキーワードマッチングをより精度良好に実施可能となる。 This allows the above-mentioned typo issue to be dealt with appropriately, and makes it possible to perform keyword matching of the contents of the call with greater accuracy, regardless of the characteristics of the speech-to-text conversion of the call content.

また、本実施形態のテキスト化支援装置において、前記演算装置は、前記音節配列の間で母音数を比較し、前記抽出した音素よりも前記正しい音素に対応する音節配列での母音数が少ない場合、前記正しい音素の音節配列が正とした場合、前記抽出した音素の音節配列において母音が余剰となっている部分を削除して補正し、前記補正が行われた前記抽出した音素及び前記正しい音素のそれぞれの前記母音の配列の間で母音の一致率を前記発話類似度に基づき算定するものである、としてもよい。 In addition, in the text conversion assistance device of this embodiment, the calculation device may compare the number of vowels between the syllable sequences, and if the number of vowels in the syllable sequence corresponding to the correct phoneme is smaller than that of the extracted phoneme, determine that the syllable sequence of the correct phoneme is correct, delete the portion in which there are excess vowels in the syllable sequence of the extracted phoneme to correct it, and calculate the vowel matching rate between the vowel sequences of the corrected extracted phoneme and the correct phoneme based on the speech similarity.

これによれば、上述の衍字の事象に対して適切に対処し、通話内容に関する音声テキスト化の特性に関わらず、当該通話内容のキーワードマッチングをより精度良好に実施可能となる。 This makes it possible to deal appropriately with the above-mentioned problem of altered characters and perform keyword matching of the contents of the phone call with greater accuracy, regardless of the characteristics of the speech-to-text conversion of the contents of the phone call.

また、本実施形態のテキスト化支援方法において、前記情報処理装置が、前記記憶装置において、母音間の発話類似度を規定した情報をさらに保持し、前記一致率の算定に際し、前記正しい音素及び前記抽出した音素のそれぞれに含まれる母音間の一致率を、前記発話類似度の情報に基づいて算定する処理と、前記正しい音素及び前記抽出した音素のそれ
ぞれに含まれる子音間の一致率を算定する処理と、前記母音間の一致率を前記子音間の一致率よりも優位に重み付けて、前記母音間及び前記子音間の各一致率に基づき前記音素同士の一致率を算定する、としてもよい。 In addition, in the text conversion assistance method of this embodiment, the information processing device may further store information in the storage device that specifies the speech similarity between vowels, and when calculating the matching rate, the information processing device may perform a process of calculating the matching rate between the vowels contained in each of the correct phoneme and the extracted phoneme based on the speech similarity information, a process of calculating the matching rate between the consonants contained in each of the correct phoneme and the extracted phoneme, and a process of weighting the matching rate between the vowels more heavily than the matching rate between the consonants to calculate the matching rate between the phonemes based on the matching rates between the vowels and between the consonants.

また、本実施形態のテキスト化支援方法において、前記情報処理装置が、前記一致率の算定に際し、前記抽出した音素及び前記正しい音素のそれぞれに関して、当該音素に含まれる母音の配列において、連続する２つの母音の組みにおける類似度を前記発話類似度で特定し、前記類似度が基準以上の組みについては予め定めた１つの規定母音に畳み込み、前記類似度が基準を下回る組みについては先頭の母音を採用して、後尾の母音を前記配列において隣接する母音と次なる組みを形成する処理を実行して、音節配列を生成し、前記抽出した音素及び前記正しい音素のそれぞれに関して生成した、前記音節配列の間で母音数を比較し、当該母音数が等しい場合、当該音節配列の元となった、前記抽出した音素及び前記正しい音素のそれぞれの前記母音の配列の間で母音の一致率を前記発話類似度に基づき算定する、としてもよい。 In addition, in the text conversion assistance method of this embodiment, when calculating the matching rate, the information processing device may specify, for each of the extracted phoneme and the correct phoneme, the similarity between a pair of two consecutive vowels in the sequence of vowels contained in the phoneme using the speech similarity, and for pairs whose similarity is equal to or greater than a standard, convolve them into one predetermined specified vowel, and for pairs whose similarity is below the standard, use the first vowel and execute a process of forming the next pair of the last vowel with the adjacent vowel in the sequence to generate a syllable sequence, compare the number of vowels between the syllable sequences generated for each of the extracted phoneme and the correct phoneme, and if the number of vowels is equal, calculate the vowel matching rate between the vowel sequences of the extracted phoneme and the correct phoneme that were the basis of the syllable sequence based on the speech similarity.

また、本実施形態のテキスト化支援方法において、前記情報処理装置が、前記音節配列の間で母音数を比較し、前記抽出した音素よりも前記正しい音素に対応する音節配列での母音数が多い場合、前記正しい音素の音節配列が正とした場合、前記抽出した音素の音節配列において母音が欠けている部分について、当該正しい音素の対応音素で補って補正し、前記補正が行われた前記抽出した音素及び前記正しい音素のそれぞれの前記母音の配列の間で母音の一致率を前記発話類似度に基づき算定する、としてもよい。 In addition, in the text conversion assistance method of this embodiment, the information processing device may compare the number of vowels between the syllable sequences, and if the number of vowels in the syllable sequence corresponding to the correct phoneme is greater than that of the extracted phoneme, determine that the syllable sequence of the correct phoneme is correct, correct the missing vowel in the syllable sequence of the extracted phoneme by filling in the corresponding phoneme of the correct phoneme, and calculate the vowel matching rate between the vowel sequences of the extracted phoneme and the correct phoneme after the correction based on the speech similarity.

また、本実施形態のテキスト化支援方法において、前記情報処理装置が、前記音節配列の間で母音数を比較し、前記抽出した音素よりも前記正しい音素に対応する音節配列での母音数が少ない場合、前記正しい音素の音節配列が正とした場合、前記抽出した音素の音節配列において母音が余剰となっている部分を削除して補正し、前記補正が行われた前記抽出した音素及び前記正しい音素のそれぞれの前記母音の配列の間で母音の一致率を前記発話類似度に基づき算定する、としてもよい。 In addition, in the text conversion assistance method of this embodiment, the information processing device may compare the number of vowels between the syllable sequences, and if the number of vowels in the syllable sequence corresponding to the correct phoneme is smaller than that of the extracted phoneme, determine that the syllable sequence of the correct phoneme is correct, delete the portion in which there are excess vowels in the syllable sequence of the extracted phoneme to correct it, and calculate the vowel matching rate between the vowel sequences of the corrected extracted phoneme and the correct phoneme based on the speech similarity.

１ネットワーク
１００テキスト化支援装置
１０１記憶装置
１０２プログラム
１０３メモリ
１０４演算装置
１０５通信装置
１１０音響モデル
１１１言語モデル
１２５通話録音ＤＢ
１２６音素マスタテーブル
１２７発話類似度テーブル
２００オペレータ端末
３００コールセンタシステム
４００管理者端末 1 Network 100 Text conversion support device 101 Storage device 102 Program 103 Memory 104 Arithmetic device 105 Communication device 110 Acoustic model 111 Language model 125 Call recording DB
126 Phoneme master table 127 Speech similarity table 200 Operator terminal 300 Call center system 400 Manager terminal

Claims

A storage device that holds master data defining correct phoneme information for each vocabulary expected to appear in each conversation scene or subject , and information defining speech similarity between vowels ;
a calculation device for executing a process of applying the call recording data obtained from a specified device to an acoustic model to extract phonemes; a process of calculating a matching rate between the correct phonemes of vocabulary, of which phonemes are defined in the master data, that are expected to appear in the conversation scene or subject of the call recording data and the extracted phonemes, the process of calculating a matching rate between the vowels included in the correct phoneme and the extracted phoneme based on information on the speech similarity, calculating a matching rate between the consonants included in the correct phoneme and the extracted phoneme, and calculating a matching rate between the phonemes based on the matching rates between the vowels and the consonants by weighting the matching rate between the vowels more heavily than the matching rate between the consonants; and a process of specifying the vocabulary whose phonemes show a predetermined matching rate as a keyword matching result.
A text conversion assistance device comprising:

The computing device includes:
When calculating the match rate,
For each of the extracted phonemes and the correct phonemes, a similarity between a pair of two consecutive vowels in a sequence of vowels included in the phoneme is specified by the speech similarity, and pairs having the similarity equal to or greater than a standard are folded into one predetermined defined vowel, and pairs having the similarity below the standard are folded into a next pair by adopting the first vowel and forming a next pair of the last vowel with an adjacent vowel in the sequence, thereby generating a syllable sequence;
a number of vowels is compared between the syllable sequences generated for the extracted phoneme and the correct phoneme, and if the number of vowels is equal, a vowel matching rate is calculated between the vowel sequences of the extracted phoneme and the correct phoneme, which are the basis of the syllable sequence, based on the speech similarity.
2. The text conversion assistance device according to claim 1 .

The computing device includes:
comparing the number of vowels between the syllable sequences, and if the number of vowels in the syllable sequence corresponding to the correct phoneme is greater than that of the extracted phoneme, determining that the syllable sequence of the correct phoneme is correct, correcting a portion where a vowel is missing in the syllable sequence of the extracted phoneme by filling in the corresponding phoneme of the correct phoneme;
A vowel matching rate between the vowel sequences of the extracted phonemes after the correction and the correct phonemes is calculated based on the speech similarity.
3. The text conversion assistance device according to claim 2 .

The computing device includes:
comparing the number of vowels between the syllable sequences, and if the number of vowels in the syllable sequence corresponding to the correct phoneme is smaller than that of the extracted phoneme, determining that the syllable sequence of the correct phoneme is correct, correcting the syllable sequence of the extracted phoneme by deleting a portion in which there are excess vowels;
A vowel matching rate between the vowel sequences of the extracted phonemes after the correction and the correct phonemes is calculated based on the speech similarity.
3. The text conversion assistance device according to claim 2 .

An information processing device,
A storage device stores master data that specifies correct phoneme information for each vocabulary expected to appear in each conversation scene or subject , and information that specifies speech similarity between vowels ;
a process of applying the call recording data obtained from a specified device to an acoustic model to extract phonemes; a process of calculating a matching rate between the correct phonemes of vocabulary whose phonemes are defined in the master data and the extracted phonemes , the process of calculating a matching rate between the vowels contained in the correct phonemes and the extracted phonemes based on information on the speech similarity, calculating a matching rate between the consonants contained in the correct phonemes and the extracted phonemes, and calculating a matching rate between the phonemes based on the matching rates between the vowels and the consonants by weighting the matching rate between the vowels more heavily than the matching rate between the consonants; and a process of identifying the vocabulary whose phonemes show a predetermined matching rate as a result of the calculation as a keyword matching result.
A text conversion assistance method that performs the above.

The information processing device,
When calculating the match rate,
For each of the extracted phonemes and the correct phonemes, a similarity between a pair of two consecutive vowels in a sequence of vowels included in the phoneme is specified by the speech similarity, and pairs having the similarity equal to or greater than a standard are folded into one predetermined defined vowel, and pairs having the similarity below the standard are folded into a next pair by adopting the first vowel and forming a next pair of the last vowel with an adjacent vowel in the sequence, thereby generating a syllable sequence;
comparing the number of vowels between the syllable sequences generated for the extracted phoneme and the correct phoneme, respectively, and, if the numbers of vowels are equal, calculating a vowel matching rate between the vowel sequences of the extracted phoneme and the correct phoneme, which are the basis of the syllable sequence, based on the speech similarity;
The text generation assistance method according to claim 5 .

The information processing device,
comparing the number of vowels between the syllable sequences, and if the number of vowels in the syllable sequence corresponding to the correct phoneme is greater than that of the extracted phoneme, determining that the syllable sequence of the correct phoneme is correct, correcting a portion where a vowel is missing in the syllable sequence of the extracted phoneme by filling in the corresponding phoneme of the correct phoneme;
calculating a vowel matching rate between the vowel sequences of the extracted phonemes after the correction and the correct phonemes based on the speech similarity;
The text generation assistance method according to claim 6 .

The information processing device,
comparing the number of vowels between the syllable sequences, and if the number of vowels in the syllable sequence corresponding to the correct phoneme is smaller than that of the extracted phoneme, determining that the syllable sequence of the correct phoneme is correct, correcting the syllable sequence of the extracted phoneme by deleting a portion in which there are excess vowels;
calculating a vowel matching rate between the vowel sequences of the extracted phonemes after the correction and the correct phonemes based on the speech similarity;
The text generation assistance method according to claim 6 .