JP7541474B2

JP7541474B2 - Speech evaluation system

Info

Publication number: JP7541474B2
Application number: JP2020206677A
Authority: JP
Inventors: 謙吾竹谷; 憲卓岡本; 保静松岡; 聡一朗村上; 熱気澤山
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2020-12-14
Filing date: 2020-12-14
Publication date: 2024-08-28
Anticipated expiration: 2040-12-14
Also published as: JP2022093939A

Description

本発明は、発話評価システムに関する。 The present invention relates to a speech evaluation system.

文献１には、人間によって話者の発話が聞き起こされた結果と、所定の音声認識部によって話者の発話が音声認識された結果とを比較することによって、音声認識された結果が正しいか否かを判定する音声認識結果評価装置が開示されている。このような装置によれば、音声認識された結果における認識の誤りを検出することができる。 Literature 1 discloses a speech recognition result evaluation device that determines whether the speech recognition result is correct by comparing the result of a speaker's speech being transcribed by a human with the result of speech recognition of the speaker's speech by a specified speech recognition unit. Such a device makes it possible to detect recognition errors in the speech recognition result.

特開２００１－６７０９６号公報JP 2001-67096 A

ここで、発話の音声認識に関して、認識した発話の文意を正確に捉えるため、発話における発音の誤りを検出したいというユーザの要望がある。しかしながら、上述した従来の装置では、音声認識の対象である話者の発話そのものに発音の誤りがあった場合に、発音の誤りを適切に検出することはできない。例えば、発話が音声認識された結果から、発話における誤りが検出される場合、通常、話者による単語の誤用と、話者による発音の誤りとが区別されず、いずれも発話における誤りとして検出されてしまう。 Here, with regard to speech recognition of speech, there is a desire from users to detect pronunciation errors in speech in order to accurately grasp the meaning of the recognized utterance. However, the conventional devices described above are unable to properly detect pronunciation errors when there is a pronunciation error in the speech itself of the speaker who is the target of speech recognition. For example, when an error in speech is detected from the results of speech recognition of an utterance, typically there is no distinction between the speaker's misuse of words and the speaker's pronunciation error, and both are detected as errors in speech.

本発明は、上記実情に鑑みてなされたものであり、話者の発話における発音の誤りを適切に検出することができる発話評価システムを提供することを目的とする。 The present invention has been made in consideration of the above-mentioned circumstances, and aims to provide a speech evaluation system that can appropriately detect pronunciation errors in a speaker's speech.

本発明の一態様に係る発話評価システムは、話者の発話を音声認識した結果を取得し、該音声認識した結果を、発音を表す文字列に変換する変換部と、話者の発話に出現すると想定される単語の発音を表す文字列である想定文字列と、変換部によって変換された発音を表す文字列に含まれる一又は複数の発話文字列との編集距離を算出する算出部と、一又は複数の発話文字列のうち、編集距離が所定値以下であり、且つ、想定文字列と同一の文字列ではない発話文字列を、発音誤り文字列として検出する検出部と、発音誤り文字列を出力する出力部と、を備え、算出部は、想定文字列及び発話文字列に含まれる子音の発音分類を考慮して、編集距離を算出する。 The speech evaluation system according to one aspect of the present invention includes a conversion unit that acquires the result of speech recognition of a speaker's speech and converts the result of speech recognition into a character string representing a pronunciation; a calculation unit that calculates an edit distance between an expected character string, which is a character string representing the pronunciation of a word expected to appear in the speaker's speech, and one or more spoken character strings included in the character string representing the pronunciation converted by the conversion unit; a detection unit that detects, among the one or more spoken character strings, a spoken character string whose edit distance is equal to or less than a predetermined value and is not identical to the expected character string as a mispronunciation character string; and an output unit that outputs the mispronunciation character string. The calculation unit calculates the edit distance taking into account the pronunciation classification of consonants included in the expected character string and the spoken character string.

本発明の一態様に係る発話評価システムでは、発音を表す文字列に話者の発話が変換され、話者の発話に出現すると想定される単語の発音を表す文字列である想定文字列と、変換された発音を表す文字列に含まれる一又は複数の発話文字列との編集距離が算出され、一又は複数の発話文字列のうち、編集距離が所定値以下であり、且つ、想定文字列と同一の文字列ではない発話文字列が、発音誤り文字列として検出される。ここで、発話における誤り検出において、例えば、発話を音声認識した結果が文章に変換され、発話における誤りが検出される場合には、通常、話者による単語の誤用と、話者による発音誤りとが区別されず、いずれも発話における誤りとして検出されてしまう。このような検出処理では、話者の発話における発音誤りを適切に検出することができない。この点、本発明の一態様に係る発話評価システムでは、話者の発話が、発音を表す文字列に変換され、該文字列に含まれる発話文字列と、想定文字列との編集距離が導出されて、該編集距離が所定値以下である発話文字列が発話誤り文字列として検出されるため、「誤り度合いが小さく、単語の誤用というよりも単なる発音の誤りである可能性が高い」と推定される発話文字列について、適切に発音誤り文字列として検出することができる。さらに、本発明の一態様に係る発話評価システムでは、子音の発音分類が考慮されて、想定文字列と発話文字列との編集距離が算出される。このような構成によれば、例えば、想定文字列と発音が類似する発音文字列ほど想定文字列との編集距離が小さくなるように、編集距離が算出される。これにより、検出部は、想定文字列に発音が近い発話文字列（「単語の誤用というよりも単なる発音の誤りである可能性が高い」文字列）を発音誤り文字列として検出することができる。以上のように、本発明の一態様に係る発話評価システムによれば、話者の発話における発音の誤りを適切に検出することができる。 In a speech evaluation system according to one aspect of the present invention, a speaker's speech is converted into a character string representing a pronunciation, an edit distance between an expected character string, which is a character string representing the pronunciation of a word expected to appear in the speaker's speech, and one or more speech character strings included in the converted character string representing the pronunciation is calculated, and among the one or more speech character strings, a speech character string whose edit distance is equal to or less than a predetermined value and which is not identical to the expected character string is detected as a mispronunciation character string. Here, in detecting errors in speech, for example, when the result of speech recognition is converted into a sentence and an error in speech is detected, a speaker's misuse of a word and a speaker's mispronunciation are not usually distinguished from each other, and both are detected as errors in speech. In such a detection process, it is not possible to appropriately detect mispronunciations in the speaker's speech. In this regard, in the speech evaluation system according to one aspect of the present invention, the speaker's speech is converted into a character string representing the pronunciation, the edit distance between the speech string included in the character string and the expected character string is derived, and the speech string with the edit distance equal to or less than a predetermined value is detected as a speech error string. Therefore, a speech string estimated to have a small degree of error and a high probability of being a mere pronunciation error rather than a misuse of a word can be appropriately detected as a mispronounced string. Furthermore, in the speech evaluation system according to one aspect of the present invention, the edit distance between the expected character string and the speech string is calculated taking into account the pronunciation classification of the consonant. With this configuration, for example, the edit distance is calculated so that the edit distance between the expected character string and the speech string becomes smaller as the pronunciation of the pronunciation string becomes more similar to that of the expected character string. As a result, the detection unit can detect a speech string with a pronunciation close to that of the expected character string (a string that is "high probability of being a mere pronunciation error rather than a misuse of a word") as a mispronounced string. As described above, the speech evaluation system according to one aspect of the present invention can appropriately detect pronunciation errors in the speaker's speech.

本発明によれば、話者の発話における発音誤りを適切に検出することができる発話評価システムを提供することができる。 The present invention provides a speech evaluation system that can appropriately detect pronunciation errors in a speaker's speech.

本実施形態に係る発話評価システムの概要を説明する図である。FIG. 1 is a diagram illustrating an overview of an utterance evaluation system according to an embodiment of the present invention. 本実施形態に係る発話評価システムの機能構成を示すブロック図である。1 is a block diagram showing a functional configuration of an utterance evaluation system according to an embodiment of the present invention; 図１の編集距離の算出において考慮される子音の発音分類の一例を示す図である。FIG. 2 is a diagram showing an example of pronunciation classification of consonants taken into consideration in the calculation of the edit distance in FIG. 1 . 図３の子音の発音分類を考慮した図１の編集距離の算出の一例を示す図である。FIG. 4 is a diagram showing an example of calculation of the edit distance in FIG. 1 taking into account the pronunciation classification of the consonants in FIG. 3 . 本実施形態に係る発話評価システムが実施する処理を示すフローチャートである。4 is a flowchart showing a process performed by the speech evaluation system according to the present embodiment. 図１の編集距離の算出について、従来技術、及び、本実施形態に係る技術を比較した図である。2 is a diagram comparing the conventional technique and the technique according to the present embodiment with respect to the calculation of the edit distance in FIG. 1; FIG. 従来技術及び本実施形態の技術による発話評価について説明する図である。1A and 1B are diagrams illustrating speech evaluation according to the conventional technology and the technology of the present embodiment. 本実施形態に係る発話評価システムに含まれる通信端末、発話評価サーバのハードウェア構成を示す図である。2 is a diagram showing the hardware configuration of a communication terminal and an utterance evaluation server included in the utterance evaluation system according to the present embodiment. FIG.

以下、添付図面を参照しながら本発明の実施形態を詳細に説明する。図面の説明において、同一又は同等の要素には同一符号を用い、重複する説明を省略する。 Embodiments of the present invention will now be described in detail with reference to the accompanying drawings. In the description of the drawings, the same or equivalent elements will be designated by the same reference numerals, and duplicate descriptions will be omitted.

図１及び図２に示される発話評価システム１は、話者Ｓの発話を音声認識し、音声認識結果を評価するシステムである。音声認識結果を評価するとは、例えば、音声認識結果である発話内容の文字列が適切であるか否かを評価することをいう。一例としては、発話評価システム１は、語学検定等に用いられるシステムであり、質問に対する話者Ｓの発話内容の文字列（音声認識結果）が、当該質問に対する回答を示す文字列として適切であるか否かを評価する。 The speech evaluation system 1 shown in Figures 1 and 2 is a system that recognizes the speech of a speaker S and evaluates the speech recognition result. Evaluating the speech recognition result means, for example, evaluating whether or not a character string of the speech content that is the speech recognition result is appropriate. As an example, the speech evaluation system 1 is a system used for language tests, etc., and evaluates whether or not a character string of the speech content of a speaker S in response to a question (speech recognition result) is appropriate as a character string indicating an answer to the question.

発話評価システム１は、より詳細には、音声認識結果において発音誤りがあるか否かを判定し、発音誤りがある場合には、該発音誤りを訂正し、訂正後の音声認識結果について評価を行う。発話評価システム１は、音声認識結果について、発音を表す文字列に変換し、子音の発音類似性を考慮して音声認識結果から発音の誤りを検出し、訂正する。発話評価システム１は、例えば、質問に対する話者Ｓの発話に出現すると想定される単語の発音を表す文字列と、音声認識結果に係る発音を表す文字列との子音の発音類似度を導出し、両文字列が同一ではなく且つ両文字列に含まれる子音の発音類似度が所定値よりも高い場合には、話者Ｓの発話に発音の誤りが含まれていると判定（発音の誤りを検出）し、発音の誤りを訂正し、訂正後の音声認識結果について評価を行う。発話評価システム１は、話者Ｓの発話の集音を行う機能を有する通信端末１０と、データの送受信及びデータ処理を行う発話評価サーバ３０と、を備えている。最初に、発話評価システム１が行う処理の概要について説明する。 More specifically, the speech evaluation system 1 determines whether or not there is a pronunciation error in the speech recognition result, and if there is a pronunciation error, corrects the pronunciation error and evaluates the speech recognition result after the correction. The speech evaluation system 1 converts the speech recognition result into a character string representing the pronunciation, detects and corrects the pronunciation error from the speech recognition result taking into account the pronunciation similarity of the consonants. For example, the speech evaluation system 1 derives the pronunciation similarity of consonants between a character string representing the pronunciation of a word expected to appear in the speech of the speaker S in response to a question and a character string representing the pronunciation related to the speech recognition result, and if the two character strings are not identical and the pronunciation similarity of the consonants contained in both character strings is higher than a predetermined value, it determines that the speech of the speaker S contains a pronunciation error (detects the pronunciation error), corrects the pronunciation error, and evaluates the speech recognition result after the correction. The speech evaluation system 1 includes a communication terminal 10 having a function of collecting the speech of the speaker S, and a speech evaluation server 30 that transmits and receives data and processes data. First, we will provide an overview of the processing performed by the speech evaluation system 1.

発話評価システム１では、発話評価サーバ３０が、話者Ｓの発話に出現すると想定される単語の発音を表す文字列（以下、想定文字列と表記する。）、及び、子音の発音分類（後述する）を予め記憶している。図１に示される例では、想定文字列が「ｇｉｔａ－」とされている。 In the speech evaluation system 1, the speech evaluation server 30 pre-stores character strings (hereinafter referred to as expected character strings) representing the pronunciation of words expected to appear in the speech of speaker S, and pronunciation classifications of consonants (described later). In the example shown in FIG. 1, the expected character string is "gita-".

通信端末１０等において、例えば、アプリケーションが実行され、発話評価が行われる場合、通信端末１０は、マイク等によって集音された話者Ｓの発話を、発話音声データとして発話評価サーバ３０に送信する。発話評価サーバ３０は、通信端末１０から取得した発話音声データを用いて、話者Ｓの発話について音声認識を行う。発話評価サーバ３０は、音声認識した結果を、発音を表す文字列に変換する。 For example, when an application is executed on a communication terminal 10 or the like to perform speech evaluation, the communication terminal 10 transmits the speech of speaker S, collected by a microphone or the like, to the speech evaluation server 30 as speech voice data. The speech evaluation server 30 performs voice recognition on the speech of speaker S using the speech voice data acquired from the communication terminal 10. The speech evaluation server 30 converts the result of the voice recognition into a character string representing the pronunciation.

図１に示される例では、「あなたの趣味はなんですか？」という問いに対する話者Ｓの「趣味はビターを弾くことです。」という発話の音声が通信端末１０のマイクによって集音され、集音された音声は、発話音声データとして通信端末１０によって発話評価サーバ３０に送信される。次に、発話評価サーバ３０によって、当該発話音声データが音声認識されることで、話者Ｓの発話が音声認識される。そして、発話評価サーバ３０によって、音声認識した結果として「趣味はギターを弾くことです。」という文章が生成される。そして、発話評価サーバ３０によって、生成された文章が、「ｓｙｕｍｉｈａｂｉｔａ－ｗｏｈｉｋｕｋｏｔｏｄｅｓｕ。」という発音を表すローマ字文に変換される。 In the example shown in FIG. 1, the voice of speaker S responding to the question "What are your hobbies?" with "My hobby is playing the guitar" is collected by the microphone of the communication terminal 10, and the collected voice is transmitted by the communication terminal 10 to the speech evaluation server 30 as speech voice data. Next, the speech evaluation server 30 performs speech recognition on the speech voice data, thereby performing speech recognition on the speech of speaker S. Then, the speech evaluation server 30 generates the sentence "My hobby is playing the guitar" as a result of the speech recognition. The speech evaluation server 30 then converts the generated sentence into a romanized text that expresses the pronunciation "syuumihabita-wohikukotodesu."

発話評価サーバ３０は、想定文字列を取得する。発話評価サーバ３０は、音声認識した結果から変換された発音を表す文字列（以下、変換文字列と表記する。）から、想定文字列と同じ文字数の一又は複数の文字列（以下、発話文字列と表記する。）を生成する。発話評価サーバ３０は、想定文字列と各発話文字列との編集距離を、子音の発音分類を考慮して（後述する）算出する。ここで、編集距離とは、１文字の置換を１手順として、ある文字列を他の文字列に変形するのに必要な手順の最小回数を指す。発話評価サーバ３０は、算出した編集距離が所定値以下となる発話文字列の内、想定文字列と同一の文字列でないものを発音誤り文字列として検出する。 The speech evaluation server 30 acquires an expected string. The speech evaluation server 30 generates one or more strings (hereinafter referred to as spoken strings) with the same number of characters as the expected string from a string representing the pronunciation converted from the results of speech recognition (hereinafter referred to as a converted string). The speech evaluation server 30 calculates the edit distance between the expected string and each spoken string taking into account the pronunciation classification of the consonants (described below). Here, the edit distance refers to the minimum number of steps required to transform a string into another string, with the replacement of one character being one step. Among the spoken strings whose calculated edit distance is less than or equal to a predetermined value, the speech evaluation server 30 detects those that are not identical to the expected string as mispronounced strings.

図１に示される例では、発話評価サーバ３０によって、「ｓｙｕｍｉｈａｂｉｔａ－ｗｏｈｉｋｕｋｏｔｏｄｅｓｕ。」というローマ字文から、想定文字列である「ｇｉｔａ－」と同じ文字数である５文字単位のＮ－ｇｒａｍが生成され、それぞれ発話文字列として特定される。ここで、Ｎ－ｇｒａｍとは、変換文字列において、最初の文字を１文字ずつずらしながら重複を許して抜き出された所定の文字数の文字列である。そして、発話評価サーバ３０によって、想定文字列「ｇｉｔａ－」と各発話文字列との編集距離を文字数５で除算した値である誤り率が算出される。さらに、発話評価サーバ３０によって、想定文字列「ｇｉｔａ－」と比較された結果、誤り率が閾値を下回った発話文字列「ｂｉｔａ－」は、発音誤り文字列として検出される。 In the example shown in FIG. 1, the speech evaluation server 30 generates N-grams of five characters, the same number of characters as the expected character string "gita-", from the Roman alphabet sentence "syumihabita-wohikukotodesu.", and identifies each as a spoken character string. Here, an N-gram is a character string of a predetermined number of characters extracted from a converted character string while shifting the first character by one character and allowing overlaps. The speech evaluation server 30 then calculates an error rate, which is the value obtained by dividing the edit distance between the expected character string "gita-" and each spoken character string by the number of characters, 5. Furthermore, the speech evaluation server 30 detects the spoken character string "bita-", whose error rate falls below a threshold as a result of comparison with the expected character string "gita-", as a mispronounced character string.

発話評価サーバ３０は、発音誤り文字列を検出した場合、変換文字列において、発音誤り文字列を、該発音誤り文字列の検出に用いた想定文字列に訂正する。発話評価サーバ３０は、訂正後の変換文字列（例えばローマ字文）を、話者Ｓの発話において用いられた言語の文章形式（日本語文）に変換する（以下、変換後の該文章を、評価文章と表記する）。そして、発話評価サーバ３０は、発話誤り文字列を、話者Ｓの発話において用いられた言語に変換する（以下、変換後の該単語を、発音誤り単語と表記する）。発話評価サーバ３０は、評価文章を採点し、評価文章、発音誤り単語、及び、採点結果を出力する。なお、発話評価サーバ３０は、発音誤り文字列を検出しない場合、最初に音声認識した結果を、評価文章とする。 When the speech evaluation server 30 detects a mispronounced character string, it corrects the mispronounced character string in the converted character string to the assumed character string used to detect the mispronounced character string. The speech evaluation server 30 converts the corrected converted character string (e.g., a Roman alphabet text) into a sentence format (Japanese text) in the language used in the speech of the speaker S (hereinafter, the converted text will be referred to as an evaluation text). The speech evaluation server 30 then converts the mispronounced character string into the language used in the speech of the speaker S (hereinafter, the converted word will be referred to as a mispronounced word). The speech evaluation server 30 scores the evaluation text, and outputs the evaluation text, the mispronounced word, and the scoring result. Note that, when the speech evaluation server 30 does not detect a mispronounced character string, the result of the initial voice recognition is used as the evaluation text.

図１に示される例では、発話評価サーバ３０によって、ローマ字文「ｓｙｕｍｉｈａｂｉｔａ－ｗｏｈｉｋｕｋｏｔｏｄｅｓｕ。」において、発音誤り文字列として検出された発話文字列「ｂｉｔａ－」を、想定文字列「ｇｉｔａ－」に訂正する。そして、発話評価サーバ３０によって、発音誤り訂正後のローマ字文「ｓｙｕｍｉｈａｇｉｔａ－ｗｏｈｉｋｕｋｏｔｏｄｅｓｕ。」は、発音誤り訂正後の日本語文「趣味はギターを弾くことです。」（評価文章）に変換される。さらに、評価文章「趣味はギターを弾くことです。」に単語の誤りがないことから、発話評価サーバ３０によって、発音誤り訂正後の日本語文は「１００点」であると採点される。 In the example shown in FIG. 1, the speech evaluation server 30 corrects the speech character string "bita-" detected as a mispronounced character string in the romanized sentence "syumihabita-wohikukotodesu." to the expected character string "gita-". Then, the speech evaluation server 30 converts the romanized sentence "syumihagita-wohikukotodesu." after the pronunciation error correction into the Japanese sentence "My hobby is playing the guitar." (evaluation sentence) after the pronunciation error correction. Furthermore, since there are no word errors in the evaluation sentence "My hobby is playing the guitar.", the speech evaluation server 30 scores the Japanese sentence after the pronunciation error correction as "100 points".

次に、図２を参照して、通信端末１０、及び、発話評価サーバ３０の機能的な構成要素について説明する。 Next, referring to FIG. 2, the functional components of the communication terminal 10 and the speech evaluation server 30 will be described.

通信端末１０は、例えば、サーバと通信可能に構成された端末である。通信端末１０は、例えば、スマートフォン、タブレット型端末、ＰＣ等である。通信端末１０は、例えばアプリケーションが実行されると、実装されたマイク等により話者Ｓの発話を集音し、集音した音声を発話音声データとして発話評価サーバ３０に送信する。また、通信端末１０は、発話評価サーバ３０から受信した評価文章、発音誤り単語、及び、採点結果を、実装されたディスプレイ等の画面上に表示する。図１に示される例では、通信端末１０は、「趣味はビターを弾くことです。」という発話内容を、発話音声データとして発話評価サーバ３０に送信する。そして、通信端末１０は、発話評価サーバ３０から、評価文章「趣味はギターを弾くことです。」、発音誤り単語「ビター」、及び、採点結果「１００点」を取得し、画面上に表示する。なお、通信端末１０は、子音の発音分類（後述する）、話者Ｓの母語、及び、想定文字列の入力を受け付け、入力結果を発話評価サーバ３０に送信してもよい。 The communication terminal 10 is, for example, a terminal configured to be able to communicate with a server. The communication terminal 10 is, for example, a smartphone, a tablet terminal, a PC, etc. When an application is executed, the communication terminal 10 collects the speech of the speaker S using a microphone or the like mounted thereon, and transmits the collected voice to the speech evaluation server 30 as speech voice data. The communication terminal 10 also displays the evaluation sentence, the mispronounced word, and the scoring result received from the speech evaluation server 30 on a screen such as a display mounted thereon. In the example shown in FIG. 1, the communication terminal 10 transmits the speech content "My hobby is playing the guitar" to the speech evaluation server 30 as speech voice data. Then, the communication terminal 10 acquires the evaluation sentence "My hobby is playing the guitar", the mispronounced word "bitter", and the scoring result "100 points" from the speech evaluation server 30, and displays them on the screen. The communication terminal 10 may also accept input of the pronunciation classification of consonants (described below), the native language of speaker S, and an expected character string, and transmit the input results to the speech evaluation server 30.

発話評価サーバ３０は、機能的な構成要素として、記憶部３１と、音声認識部３２と、変換部３３と、算出部３４と、検出部３５と、訂正部３６と、採点部３７と、出力部３８と、を有している。 The speech evaluation server 30 has the following functional components: a memory unit 31, a voice recognition unit 32, a conversion unit 33, a calculation unit 34, a detection unit 35, a correction unit 36, a scoring unit 37, and an output unit 38.

記憶部３１は、子音の発音分類、話者Ｓの母語、及び、想定文字列を記憶している。子音の発音分類とは、発音誤りが発生しやすいと推定される複数の子音の組み合わせである。図３に示される例では、子音の発音分類が設定されている。例えば、子音「ｒ」及び子音「ｌ」は、発音方法が類似する。ゆえに、子音「ｒ」及び子音「ｌ」は、誤って発音されやすいと推定される。例えば、話者が、「ｒ」を発音しようとして、誤って「ｌ」を発音してしまうこと、あるいはその逆があり得る。したがって、「ｒ」及び「ｌ」は同じ発音分類と設定される。また、子音の発音分類は、予め設定されてもよいし、通信端末１０から取得されてもよい。 The memory unit 31 stores the pronunciation classification of consonants, the native language of the speaker S, and the expected character string. The pronunciation classification of consonants is a combination of multiple consonants that are estimated to be prone to mispronunciation. In the example shown in FIG. 3, the pronunciation classification of consonants is set. For example, the consonants "r" and "l" are pronounced in a similar manner. Therefore, it is estimated that the consonants "r" and "l" are prone to mispronunciation. For example, a speaker may mistakenly pronounce "l" when trying to pronounce "r", or vice versa. Therefore, "r" and "l" are set to the same pronunciation classification. The pronunciation classification of consonants may be set in advance or may be obtained from the communication terminal 10.

子音の発音分類は、話者の母語に応じて設定されてもよい。具体的には、発音誤りが発生しやすいと推定される複数の子音の組み合わせが、言語ごとに異なるため、話者の母語に応じて子音の発音分類を変更あるいは追加する。図３に示される例では、英語を母語とする話者が、子音「ｔ」の発音を、子音「ｓ」または子音「ｒ」に誤りやすいことから、話者の母語が英語である場合、「ｔ，ｓ」及び「ｔ，ｒ」という分類が、子音の発音分類に追加される。一方で、話者の母語がアジア圏の言語である場合、アジア圏の言語を母語とする話者が、子音「ｚ」を子音「ｊ」に、子音「ｄ」を子音「ｒ」に、発音を誤りやすいことから、「ｚ，ｊ」及び「ｄ，ｒ」という分類が子音の発音分類に追加される。 The pronunciation classification of consonants may be set according to the native language of the speaker. Specifically, since the combination of multiple consonants that is estimated to be prone to mispronunciation differs for each language, the pronunciation classification of consonants is changed or added according to the native language of the speaker. In the example shown in FIG. 3, since a native English speaker is prone to mispronounce the consonant "t" as the consonant "s" or the consonant "r", if the native language of the speaker is English, the classifications "t, s" and "t, r" are added to the pronunciation classification of consonants. On the other hand, if the native language of the speaker is an Asian language, the classifications "z, j" and "d, r" are added to the pronunciation classification of consonants, since a native Asian speaker is prone to mispronounce the consonant "z" as the consonant "j" and the consonant "d" as the consonant "r".

記憶部３１は、発話し得る話者全ての母語を、話者Ｓの母語として記憶している。そして、記憶部３１は、予め決まっている質問の回答として想定される各文字列を、想定文字列として記憶している。具体的には、発話評価が行われる場合、話者Ｓとなり得る人間の母語（例えば、英語、中国語、ドイツ語等）が、予め設定されているか、あるいは、通信端末１０等から取得される。また、発話評価における質問内容は予め決められているため、質問の回答として想定される文字列も、予め設定されているか、あるいは、通信端末１０等から取得される。図１に示される例では、記憶部３１は、予め決まっている質問「趣味はビターを弾くことです。」の回答として、「ギター」の発音を表すローマ字文「ｇｉｔａ－」を想定文字列として記憶している。図１に示される例における、その他の想定文字列としては、例えば、「旅行」の発音を表すローマ字文「ｒｙоｋоｕ」、「野球」の発音を表すローマ字文「ｙａｋｙｕ」等が考えられる。 The storage unit 31 stores the native languages of all possible speakers as the native language of the speaker S. The storage unit 31 stores each character string assumed as an answer to a predetermined question as an expected character string. Specifically, when speech evaluation is performed, the native language (e.g., English, Chinese, German, etc.) of a person who can be the speaker S is set in advance or acquired from the communication terminal 10 or the like. Since the content of the question in the speech evaluation is predetermined, the character string assumed as an answer to the question is also set in advance or acquired from the communication terminal 10 or the like. In the example shown in FIG. 1, the storage unit 31 stores the Roman alphabet text "gita-" representing the pronunciation of "guitar" as an expected character string as an answer to the predetermined question "My hobby is playing bitters." Other expected character strings in the example shown in FIG. 1 include, for example, the Roman alphabet text "ryokou" representing the pronunciation of "travel" and the Roman alphabet text "yakyu" representing the pronunciation of "baseball."

音声認識部３２は、発話音声データを通信端末１０から取得する。音声認識部３２は、受信した発話音声データを音声認識して、音声認識の結果である文章を変換部３３及び採点部３７に引き渡す。図１に示される例では、音声認識部３２は、「趣味はビターを弾くことです。」という発話音声データを音声認識して、「趣味はビターを弾くことです。」という文章を生成する。 The voice recognition unit 32 acquires spoken voice data from the communication terminal 10. The voice recognition unit 32 performs voice recognition on the received spoken voice data, and passes the sentence resulting from the voice recognition to the conversion unit 33 and the scoring unit 37. In the example shown in FIG. 1, the voice recognition unit 32 performs voice recognition on the spoken voice data "My hobby is playing bitters" and generates the sentence "My hobby is playing bitters."

変換部３３は、話者Ｓの発話を音声認識した結果を取得し、該音声認識した結果を、発音を表す文字列に変換する。具体的には、変換部３３は、音声認識部３２から、音声認識の結果である文章を取得し、発音を表すローマ字文に該文章を変換する。変換部３３は、変換した発音を表すローマ字文（変換文字列）を、算出部３４及び訂正部３６に引き渡す。図１に示される例では、変換部３３は、音声認識部３２から、音声認識の結果である文章「趣味はビターを弾くことです。」を取得し、発音を表すローマ字文「ｓｙｕｍｉｈａｂｉｔａ－ｗｏｈｉｋｕｋｏｔｏｄｅｓｕ。」に変換する。 The conversion unit 33 acquires the results of speech recognition of the speaker S's speech, and converts the results of speech recognition into a character string representing the pronunciation. Specifically, the conversion unit 33 acquires a sentence that is the result of speech recognition from the speech recognition unit 32, and converts the sentence into a romanized character string representing the pronunciation. The conversion unit 33 passes the converted romanized character string representing the pronunciation to the calculation unit 34 and the correction unit 36. In the example shown in FIG. 1, the conversion unit 33 acquires the sentence "My hobby is playing bitters" that is the result of speech recognition from the speech recognition unit 32, and converts it into a romanized character string representing the pronunciation "syuumihabita-wohikukotodesu."

算出部３４は、話者Ｓの発話に出現すると想定される単語の発音を表す文字列である想定文字列と、変換部３３によって変換された発音を表す文字列に含まれる一又は複数の発話文字列との編集距離を算出する。具体的には、算出部３４は、想定文字列を記憶部３１から取得する。算出部３４は、想定文字列の文字数をＮとして、変換部３３によって変換された発音を表す文字列から一又は複数のＮ－ｇｒａｍを発話文字列として特定する。そして、算出部３４は、想定文字列と特定した発話文字列との編集距離を文字数Ｎで除算した値である誤り率を算出する。算出部３４は、全ての想定文字列について各発話文字列ごとに誤り率の算出を行い、全ての誤り率の値を検出部３５に引き渡す。なお、算出部３４が取得する想定文字列は、記憶部３１が記憶する全ての想定文字列でもよいし、質問や発話に関する情報に基づいて想定文字列の候補を絞り込める場合は、記憶部３１が記憶する一部の想定文字列でもよい。 The calculation unit 34 calculates the edit distance between an expected string, which is a string representing the pronunciation of a word expected to appear in the speech of the speaker S, and one or more utterance strings included in the string representing the pronunciation converted by the conversion unit 33. Specifically, the calculation unit 34 acquires the expected string from the storage unit 31. The calculation unit 34 identifies one or more N-grams as utterance strings from the string representing the pronunciation converted by the conversion unit 33, where the number of characters in the expected string is N. Then, the calculation unit 34 calculates an error rate, which is a value obtained by dividing the edit distance between the expected string and the identified utterance string by the number of characters N. The calculation unit 34 calculates the error rate for each utterance string for all expected strings, and transfers all the error rate values to the detection unit 35. Note that the expected strings acquired by the calculation unit 34 may be all expected strings stored in the storage unit 31, or may be some of the expected strings stored in the storage unit 31 if candidates for the expected string can be narrowed down based on information related to the question or utterance.

図１に示される例では、算出部３４は、変換部３３から取得した「ｓｙｕｍｉｈａｂｉｔａ－ｗｏｈｉｋｕｋｏｔｏｄｅｓｕ。」という２７文字のローマ字文から、想定文字列「ｇｉｔａ－」と同じ文字数の文字列を、最初の文字を１文字ずつずらしながら重複を許して抜き出すことによって、「ｓｙｕｍｉ」、「ｙｕｍｉｈ」、「ｕｍｉｈａ」、…、「ｂｉｔａ－」、…「ｄｅｓｕ。」という２３個のＮ－ｇｒａｍを生成し、発話文字列として特定する。算出部３４は、子音の発音分類を考慮して想定文字列と各発話文字列との編集距離を算出し、算出した編集距離を文字数５で除算した値を誤り率とする。例えば、算出部３４によって、想定文字列「ｇｉｔａ－」と発話文字列「ｓｙｕｍｉ」との編集距離「５」が算出され、算出された編集距離が文字数「５」で除算され、発話文字列「ｓｙｕｍｉ」の誤り率「１」が算出される。また、算出部３４によって、同様に、発話文字列「ｂｉｔａ－」の誤り率「０．１」が、子音の発音分類を考慮して（後述する）算出される。 1, the calculation unit 34 extracts character strings with the same number of characters as the expected character string "gita-" from the 27-character Roman alphabet sentence "syumihabita-wohikukotodesu." obtained from the conversion unit 33 while shifting the first character by one character and allowing overlaps, thereby generating 23 N-grams, "syumi," "yumih," "umiha," ..., "bita-," ..., "desu.", and identifies them as spoken character strings. The calculation unit 34 calculates the edit distance between the expected character string and each spoken character string in consideration of the pronunciation classification of the consonants, and determines the error rate as the value obtained by dividing the calculated edit distance by the number of characters, 5. For example, the calculation unit 34 calculates an edit distance of "5" between the expected character string "gita-" and the spoken character string "sumi", divides the calculated edit distance by the number of characters "5", and calculates an error rate of "1" for the spoken character string "sumi". Similarly, the calculation unit 34 calculates an error rate of "0.1" for the spoken character string "bita-" taking into account the pronunciation classification of the consonants (described below).

また、算出部３４は、想定文字列及び発話文字列に含まれる子音の発音分類を考慮して、編集距離を算出する。具体的には、算出部３４は、編集距離の算出において、同じ発音分類に含まれる子音同士が置換される場合、１文字の置換を０．５手順として編集距離を算出する。ここで、１手順とは、通常、ある文字列を他の文字列に変形する場合に、１文字の置換が１回行われたことを意味する。図４に示される例では、算出部３４が、想定文字列「ｇｉｔａ－」と発話文字列「ｍｉｔａ－」との編集距離、及び、想定文字列「ｇｉｔａ－」と発話文字列「ｂｉｔａ－」との編集距離をそれぞれ算出している。想定文字列「ｇｉｔａ－」を発話文字列「ｍｉｔａ－」に変形するには、１文字目のみを子音「ｇ」から子音「ｍ」に置換する必要がある。よって、算出部３４は、想定文字列「ｇｉｔａ－」と、発話文字列「ｍｉｔａ－」との編集距離を「１」と算出する。一方で、想定文字列「ｇｉｔａ－」を発話文字列「ｂｉｔａ－」に変形するには、１文字目のみを「ｇ」から「ｂ」に置換する必要がある。ここで、図３に示されるように、子音「ｇ」及び「ｂ」は、子音の発音分類において同じ分類であるため、算出部３４は、想定文字列「ｇｉｔａ－」と発話文字列「ｂｉｔａ－」との編集距離を「０．５」と算出する。なお、編集距離の算出において、同じ発音分類に含まれる子音同士の置換が行われる場合、１文字置換される際の手順数は、他の場合より小さければよい。例えば、同じ発音分類に含まれる子音同士の置換が行われる場合、１文字の置換を、０．３手順などとして編集距離を算出してもよい。 The calculation unit 34 also calculates the edit distance taking into consideration the pronunciation classification of the consonants included in the expected character string and the spoken character string. Specifically, when consonants included in the same pronunciation classification are replaced in the calculation of the edit distance, the calculation unit 34 calculates the edit distance with 0.5 steps per character replacement. Here, one step usually means that when a character string is transformed into another character string, one character replacement is performed once. In the example shown in FIG. 4, the calculation unit 34 calculates the edit distance between the expected character string "gita-" and the spoken character string "mita-", and the edit distance between the expected character string "gita-" and the spoken character string "bita-". To transform the expected character string "gita-" into the spoken character string "mita-", it is necessary to replace only the first character from the consonant "g" to the consonant "m". Therefore, the calculation unit 34 calculates the edit distance between the expected character string "gita-" and the spoken character string "mita-" to be "1". On the other hand, to transform the expected character string "gita-" into the spoken character string "bita-", it is necessary to replace only the first character from "g" to "b". Here, as shown in FIG. 3, the consonants "g" and "b" are in the same category in the consonant pronunciation classification, so the calculation unit 34 calculates the edit distance between the expected character string "gita-" and the spoken character string "bita-" as "0.5". Note that, in the calculation of the edit distance, when consonants in the same pronunciation classification are replaced, the number of steps required for replacing one character may be smaller than in other cases. For example, when consonants in the same pronunciation classification are replaced, the edit distance may be calculated with 0.3 steps for replacing one character.

検出部３５は、一又は複数の発話文字列のうち、編集距離が所定値以下であり、且つ、想定文字列と同一の文字列ではない発話文字列を、発音誤り文字列として検出する。具体的には、検出部３５は、特定の想定文字列との誤り率が算出された発話文字列の中で、誤り率が最も小さく、且つ、誤り率が０よりも大きい発話文字列について、該誤り率が、発音誤りと判定する閾値以下であったとき、該発話文字列を発音誤り文字列として検出する。検出部３５は、検出した発音誤り文字列を訂正部３６に引き渡す。また、検出部３５は、発音誤り文字列を、発音誤り単語に変換し、出力部３８に引き渡す。図１に示される例では、想定文字列「ｇｉｔａ－」との誤り率が算出された複数の発話文字列の中で、誤り率が最も小さい発話文字列「ｂｉｔａ－」の誤り率「０．１」が、発音誤りと判定する閾値「０．３」以下であったことから、検出部３５は、発話文字列「ｂｉｔａ－」を発音誤り文字列として検出する。 The detection unit 35 detects, as a mispronounced string, a spoken string that has an edit distance equal to or less than a predetermined value and is not identical to an expected string among one or more spoken strings. Specifically, the detection unit 35 detects, as a mispronounced string, a spoken string that has the smallest error rate among the spoken strings whose error rate with a specific expected string has been calculated and that has an error rate greater than 0, if the error rate is equal to or less than a threshold for determining that the string is a mispronounced string. The detection unit 35 passes the detected mispronounced string to the correction unit 36. The detection unit 35 also converts the mispronounced string into a mispronounced word and passes it to the output unit 38. In the example shown in FIG. 1, among multiple spoken strings whose error rates with the expected string "gita-" have been calculated, the spoken string "bita-" has the smallest error rate, and its error rate of "0.1" is below the threshold value of "0.3" for determining a pronunciation error, so the detection unit 35 detects the spoken string "bita-" as a mispronounced string.

訂正部３６は、変換された発音を表す文字列において、検出部にて検出された発音誤り文字列を、該発音誤り文字列の検出に用いられた編集距離の算出に用いられた想定文字列に訂正し、変換された発音を表す文字列を、発話において用いられた言語の文章に変換する。具体的には、訂正部３６は、変換部３３から変換文字列を取得する。訂正部３６は、取得した変換文字列において、発音誤り文字列を、該発音誤り文字列の検出に用いられた誤り率の算出に用いられた想定文字列に訂正する。訂正部３６は、訂正後の変換文字列を、話者Ｓの発話において用いられた言語の文章（評価文章）に変換する。訂正部３６は、評価文章を、採点部３７及び出力部３８に引き渡す。 The correction unit 36 corrects the mispronounced character string detected by the detection unit in the character string representing the converted pronunciation to the assumed character string used in calculating the edit distance used to detect the mispronounced character string, and converts the character string representing the converted pronunciation to a sentence in the language used in the utterance. Specifically, the correction unit 36 acquires the converted character string from the conversion unit 33. The correction unit 36 corrects the mispronounced character string in the acquired converted character string to the assumed character string used in calculating the error rate used to detect the mispronounced character string. The correction unit 36 converts the corrected converted character string into a sentence (evaluation sentence) in the language used in the utterance of the speaker S. The correction unit 36 passes the evaluation sentence to the scoring unit 37 and the output unit 38.

図１に示される例では、訂正部３６は、変換文字列「ｓｙｕｍｉｈａｂｉｔａ－ｗｏｈｉｋｕｋｏｔｏｄｅｓｕ。」において、発音誤り文字列「ｂｉｔａ－」を、該発音誤り文字列の検出に用いられた編集距離の算出に用いられた想定文字列「ｇｉｔａ－」に訂正する。訂正部３６は、訂正後の変換文字列「ｓｙｕｍｉｈａｇｉｔａ－ｗｏｈｉｋｕｋｏｔｏｄｅｓｕ。」を、話者Ｓの発話において用いられた言語である日本語の文章「趣味はギターを弾くことです。」（評価文章）に変換する。 In the example shown in FIG. 1, the correction unit 36 corrects the mispronounced character string "bita-" in the converted character string "syumihabita-wohikukotodesu." to the assumed character string "gita-" that was used in calculating the edit distance used to detect the mispronounced character string. The correction unit 36 converts the corrected converted character string "syumihagita-wohikukotodesu." into the Japanese sentence "My hobby is playing the guitar" (evaluation sentence), which is the language used in the utterance of the speaker S.

採点部３７は、検出部３５によって発音誤り文字列が検出された場合、訂正部３６において変換された文章を採点し、検出部３５によって発音誤り文字列が検出されない場合、音声認識した結果を採点する。具体的には、採点部３７は、検出部３５が発音誤り文字列を検出した場合、訂正部３６から評価文章を取得する。また、採点部３７は、検出部３５が発音誤り文字列を検出しない場合、音声認識部３２から音声認識の結果である文章を取得し、評価文章とする。そして、両方の場合において、採点部３７は、評価文章を採点する。採点部３７は、採点結果及び評価文章を出力部３８に引き渡す。 When the detection unit 35 detects a mispronounced character string, the scoring unit 37 scores the sentence converted by the correction unit 36, and when the detection unit 35 does not detect a mispronounced character string, the scoring unit 37 scores the result of the speech recognition. Specifically, when the detection unit 35 detects a mispronounced character string, the scoring unit 37 obtains the evaluation sentence from the correction unit 36. When the detection unit 35 does not detect a mispronounced character string, the scoring unit 37 obtains the sentence that is the result of the speech recognition from the speech recognition unit 32 and sets it as the evaluation sentence. In both cases, the scoring unit 37 scores the evaluation sentence. The scoring unit 37 passes the scoring result and the evaluation sentence to the output unit 38.

図１に示される例では、検出部３５が発音誤り文字列を検出した場合が示されており、採点部３７は、訂正部３６から訂正後の文章「趣味はギターを弾くことです。」を取得する。訂正後の文章「趣味はギターを弾くことです。」に単語等の間違いがないため、採点部３７は、「１００点」という採点結果を得る。 In the example shown in FIG. 1, the detection unit 35 detects a mispronounced character string, and the scoring unit 37 obtains the corrected sentence "My hobby is playing the guitar." from the correction unit 36. Since the corrected sentence "My hobby is playing the guitar." does not contain any errors in words, etc., the scoring unit 37 obtains a scoring result of "100 points."

出力部３８は、検出部３５から、発音誤り単語を取得する。出力部３８は、採点部３７から、採点結果及び評価文章を取得する。出力部３８は、評価文章、発音誤り単語及び採点結果を、通信端末１０などの外部機器に出力する。図１に示される例では、出力部３８は、発音誤り単語「ビター」、採点結果「１００点」及び評価文章「趣味はギターを弾くことです。」を出力する。 The output unit 38 acquires the mispronounced word from the detection unit 35. The output unit 38 acquires the scoring result and the evaluation sentence from the scoring unit 37. The output unit 38 outputs the evaluation sentence, the mispronounced word, and the scoring result to an external device such as the communication terminal 10. In the example shown in FIG. 1, the output unit 38 outputs the mispronounced word "bitter", the scoring result "100 points", and the evaluation sentence "My hobby is playing the guitar."

次に、本実施形態に係る発話評価システム１が行う処理、具体的には、通信端末１０が取得した話者Ｓの発話における発音誤りを訂正し、発話の採点を行う処理について、図５を参照して説明する。図５は、発話評価システム１が行う処理を示すフローチャートである。 Next, the process performed by the speech evaluation system 1 according to this embodiment, specifically, the process of correcting pronunciation errors in the speech of the speaker S acquired by the communication terminal 10 and scoring the speech, will be described with reference to FIG. 5. FIG. 5 is a flowchart showing the process performed by the speech evaluation system 1.

図５に示されるように、発話評価システム１では、音声認識部３２によって、発話音声データが取得される。そして、音声認識部３２によって、発話音声データは音声認識され、音声認識結果である文章が生成される（ステップＳ１０１）。続いて、変換部３３によって、音声認識結果が取得され、発音を表す文字列に変換される（ステップＳ１０２）。具体的には、変換部３３によって、音声認識の結果である文章が取得され、発音を表すローマ字文（変換文字列）に変換される。 As shown in FIG. 5, in the speech evaluation system 1, the speech recognition unit 32 acquires speech data. The speech recognition unit 32 then performs speech recognition on the speech data, and generates a sentence that is the speech recognition result (step S101). The conversion unit 33 then acquires the speech recognition result and converts it into a character string that represents the pronunciation (step S102). Specifically, the conversion unit 33 acquires the sentence that is the speech recognition result and converts it into a romanized character string that represents the pronunciation.

続いて、算出部３４によって、発音を表す文字列から発話文字列が生成される（ステップＳ１０３）。具体的には、算出部３４によって、想定文字列が記憶部３１から取得される。そして、算出部３４によって、想定文字列の文字数をＮとして、一又は複数のＮ－ｇｒａｍが、変換文字列から、発話文字列として特定される。続いて、算出部３４によって、発話文字列と想定文字列との編集距離が導出される（ステップＳ１０４）。具体的には、算出部３４によって、想定文字列及び発話文字列に含まれる子音の発音分類を考慮して、想定文字列と各発話文字列との編集距離が、想定文字列の文字数Ｎで除算され、複数の誤り率が算出される。 Then, the calculation unit 34 generates a spoken string from the string representing the pronunciation (step S103). Specifically, the calculation unit 34 acquires an expected string from the storage unit 31. Then, the calculation unit 34 identifies one or more N-grams as a spoken string from the converted string, where N is the number of characters in the expected string. Next, the calculation unit 34 derives an edit distance between the spoken string and the expected string (step S104). Specifically, the calculation unit 34 divides the edit distance between the expected string and each spoken string by the number of characters N in the expected string, taking into account the pronunciation classification of the consonants included in the expected string and the spoken string, to calculate multiple error rates.

続いて、検出部３５によって、複数の発話文字列において、発音誤り文字列が存在するか否かが判定される（ステップＳ１０５）。具体的には、検出部３５によって、特定の想定文字列との誤り率が算出された発話文字列の中で、誤り率が最も小さく、且つ、誤り率が０よりも大きい発話文字列について、誤り率が、発音誤りと判定する閾値以下であったとき、該発話文字列が発音誤り文字列として検出される。 Then, the detection unit 35 determines whether or not there is a mispronounced character string in the multiple spoken character strings (step S105). Specifically, the detection unit 35 detects a spoken character string as a mispronounced character string when the error rate of the spoken character string with the smallest error rate and the error rate greater than 0 among the spoken character strings whose error rates with a specific assumed character string have been calculated is equal to or less than a threshold for determining that the character string is a mispronounced character string.

検出部３５によって、発音誤り文字列が検出された場合（ステップＳ１０５：ＹＥＳ）、訂正部３６によって、変換文字列において、発音誤り文字列が、該発音誤り文字列の検出に用いられた誤り率の算出に用いられた想定文字列に訂正される（ステップＳ１０６）。続いて、訂正部３６によって、訂正後の変換文字列が、話者Ｓの発話において用いられた言語の文章（評価文章）に変換される（ステップＳ１０７）。一方、検出部３５によって、発音誤り文字列が検出されない場合（ステップＳ１０５：Ｎｏ）、採点部３７によって、音声認識部３２から音声認識の結果である文章が取得され、評価文章とされる（ステップＳ１０８）。 If the detection unit 35 detects a mispronounced character string (step S105: YES), the correction unit 36 corrects the mispronounced character string in the converted character string to the assumed character string used to calculate the error rate used to detect the mispronounced character string (step S106). The correction unit 36 then converts the corrected converted character string into a sentence in the language used in the utterance of the speaker S (evaluation sentence) (step S107). On the other hand, if the detection unit 35 does not detect a mispronounced character string (step S105: No), the scoring unit 37 obtains a sentence that is the result of the speech recognition from the speech recognition unit 32 and sets it as the evaluation sentence (step S108).

続いて、採点部３７によって、評価文章が採点される（ステップＳ１０９）。続いて、出力部３８によって、評価文章、発音誤り単語及び採点結果が出力される（ステップＳ１１０）。 Then, the scoring unit 37 scores the evaluation sentence (step S109). Then, the output unit 38 outputs the evaluation sentence, the mispronounced words, and the scoring results (step S110).

次に、本実施形態に係る発話評価システムの作用効果について説明する。 Next, we will explain the effects of the speech evaluation system according to this embodiment.

本実施形態に係る発話評価システム１は、話者Ｓの発話を音声認識した結果を取得し、該音声認識した結果を、発音を表す文字列に変換する変換部３３と、話者の発話に出現すると想定される単語の発音を表す文字列である想定文字列と、変換部によって変換された発音を表す文字列に含まれる一又は複数の発話文字列との編集距離を算出し、該編集距離を誤り率で除算する算出部３４と、一又は複数の発話文字列のうち、誤り率が所定値以下であり、且つ、想定文字列と同一の文字列ではない発話文字列を、発音誤り文字列として検出する検出部３５と、発音誤り文字列を出力する出力部３８と、を備え、算出部は、想定文字列及び発話文字列に含まれる子音の発音分類を考慮して、編集距離を算出する。 The speech evaluation system 1 according to this embodiment includes a conversion unit 33 that acquires the results of speech recognition of a speaker S's speech and converts the results of speech recognition into a character string representing a pronunciation, a calculation unit 34 that calculates an edit distance between an expected character string, which is a character string representing the pronunciation of a word expected to appear in the speaker's speech, and one or more spoken character strings included in the character string representing the pronunciation converted by the conversion unit, and divides the edit distance by an error rate, a detection unit 35 that detects, among the one or more spoken character strings, a spoken character string whose error rate is equal to or less than a predetermined value and which is not identical to the expected character string, as a mispronunciation character string, and an output unit 38 that outputs the mispronunciation character string, and the calculation unit calculates the edit distance taking into account the pronunciation classification of consonants included in the expected character string and the spoken character string.

本実施形態に係る発話評価システム１では、発音を表す文字列に話者Ｓの発話が変換され、話者Ｓの発話に出現すると想定される単語の発音を表す文字列である想定文字列と変換された発音を表す文字列に含まれる一又は複数の発話文字列との編集距離が算出され、該編集距離から誤り率が算出され、一又は複数の発話文字列のうち、誤り率が所定値以下であり、且つ、想定文字列と同一の文字列ではない発話文字列が、発音誤り文字列として検出される。ここで、発話における誤り検出において、例えば、発話を音声認識した結果が文章に変換され、発話における誤りが検出される場合には、通常、話者による単語の誤用と、話者による発音誤りとが区別されず、共に発話における誤りとして検出されてしまう。このような検出処理では、話者の発話における発音誤りのみを適切に検出することができない。この点、本実施形態に係る発話評価システム１では、話者Ｓの発話が、発音を表す文字列に変換され、該文字列に含まれる発話文字列と、想定文字列との編集距離が算出され、該編集距離から誤り率が算出され、該誤り率が所定値以下である発話文字列が発話誤り文字列として検出されるため、「誤り度合いが小さく、単語の誤用というよりも単なる発音の誤りである可能性が高い」と推定される発話文字列について、適切に発音誤り文字列として検出することができる。さらに、本実施形態に係る発話評価システム１では、子音の発音分類が考慮されて、想定文字列と発話文字列との編集距離が算出される。このような構成によれば、例えば、想定文字列と発音が類似する発音文字列ほど想定文字列との編集距離が小さくなるように、編集距離が算出されるため、想定文字列と発音が類似する発音文字列ほど想定文字列との誤り率が小さくなる。これにより、検出部は、想定文字列に発音が近い発話文字列（「単語の誤用というよりも単なる発音の誤りである可能性が高い」文字列）を発音誤り文字列として検出することができる。以上のように、本実施形態に係る発話評価システム１によれば、話者の発話における発音の誤りを適切に検出することができる。 In the speech evaluation system 1 according to the present embodiment, the speech of the speaker S is converted into a character string representing a pronunciation, an edit distance between an expected character string, which is a character string representing the pronunciation of a word expected to appear in the speech of the speaker S, and one or more speech character strings included in the converted character string representing the pronunciation is calculated, an error rate is calculated from the edit distance, and among the one or more speech character strings, a speech character string whose error rate is equal to or less than a predetermined value and is not identical to the expected character string is detected as a mispronunciation character string. Here, in detecting an error in speech, for example, when the result of speech recognition is converted into a sentence and an error in speech is detected, a misuse of a word by the speaker and an error in pronunciation by the speaker are usually not distinguished from each other, and both are detected as errors in speech. In such a detection process, it is not possible to appropriately detect only the pronunciation error in the speaker's speech. In this regard, in the speech evaluation system 1 according to the present embodiment, the speech of the speaker S is converted into a character string representing the pronunciation, the edit distance between the speech string included in the character string and the expected character string is calculated, the error rate is calculated from the edit distance, and the speech string with the error rate equal to or less than a predetermined value is detected as the speech erroneous character string, so that the speech string estimated to have a small degree of error and a high probability of being a mere pronunciation error rather than a misuse of a word can be appropriately detected as the mispronunciation character string. Furthermore, in the speech evaluation system 1 according to the present embodiment, the edit distance between the expected character string and the speech string is calculated taking into consideration the pronunciation classification of the consonant. According to this configuration, for example, the edit distance is calculated so that the more similar the pronunciation string is to the expected character string in pronunciation, the smaller the edit distance between the expected character string and the speech string becomes, so that the more similar the pronunciation string is to the expected character string in pronunciation, the smaller the error rate between the expected character string and the speech string becomes. As a result, the detection unit can detect the speech string whose pronunciation is close to the expected character string (the string that is "highly likely to be a mere pronunciation error rather than a misuse of a word") as the mispronunciation character string. As described above, the speech evaluation system 1 according to this embodiment can appropriately detect pronunciation errors in a speaker's speech.

図６に示される例では、「ギター」という想定単語と、「モター」、「ミター」及び「ビター」という発話単語との編集距離の算出について、従来技術、及び、本実施形態に係る技術が表されている。想定単語は、発話に出現すると想定される単語であり、発話単語は、話者Ｓによって発話された単語である。従来技術が用いられる場合、想定単語と各発話単語との編集距離は、全て「１」となる。ここで、想定単語が想定文字列に、発話単語が発話文字列に変換され、各発話文字列について、想定文字列「ｇｉｔａ－」との編集距離が算出される場合、該編集距離は、発話文字列「ｍоｔａ－」では「２」、発話文字列「ｍｉｔａ－」及び「ｂｉｔａ－」では「１」となる。このように、各単語の発音が考慮されることにより、想定文字列に発音が類似する発音文字列ほど想定文字列との編集距離が小さくなる。さらに、本実施形態に係る技術を用いて、子音の発音分類が考慮されつつ、各発話文字列について、想定文字列「ｇｉｔａ－」との編集距離が算出される場合、該編集距離は、発話文字列「ｍоｔａ－」では「２」、発話文字列「ｍｉｔａ－」では「１」、及び発話文字列「ｂｉｔａ－」では「０．５」となる。このように、各単語の発音における子音の発音類似性が考慮されることにより、想定文字列と子音の発音がより類似する発音文字列ほど想定文字列との編集距離がさらに小さくなる。以上のように、本実施形態に係る技術を用いると、想定文字列と発音が類似する発音文字列ほど想定文字列との編集距離が小さくなるように、編集距離が算出される。これにより、想定文字列と発音が類似する発音文字列ほど想定文字列との誤り率が小さくなるため、話者の発話における発音の誤りを適切に検出することができる。 In the example shown in FIG. 6, the conventional technology and the technology according to the present embodiment are shown for calculating the edit distance between the expected word "guitar" and the spoken words "motah," "mitah," and "bitter." The expected word is a word that is expected to appear in an utterance, and the spoken word is a word that is spoken by speaker S. When the conventional technology is used, the edit distance between the expected word and each spoken word is all "1." Here, when the expected word is converted into an expected character string, and the spoken word is converted into a spoken character string, and the edit distance between each spoken character string and the expected character string "gita-" is calculated, the edit distance is "2" for the spoken character string "mоta-" and "1" for the spoken character strings "mita-" and "bita-." In this way, by taking into account the pronunciation of each word, the edit distance between the expected character string and the pronunciation character string that is more similar in pronunciation to the expected character string is smaller. Furthermore, when the technology according to the present embodiment is used to calculate the edit distance between each spoken string and the expected string "gita-" while taking into account the pronunciation classification of the consonants, the edit distance is "2" for the spoken string "mоta-", "1" for the spoken string "mita-", and "0.5" for the spoken string "bita-". In this way, by taking into account the pronunciation similarity of the consonants in the pronunciation of each word, the edit distance between the expected string and the expected string becomes smaller as the pronunciation string is more similar to the expected string in the pronunciation. As described above, when the technology according to the present embodiment is used, the edit distance is calculated such that the edit distance between the expected string and the pronunciation string is smaller as the pronunciation is more similar to the expected string. As a result, the error rate between the expected string and the pronunciation string is smaller as the pronunciation is more similar to the expected string, so that pronunciation errors in the speaker's speech can be appropriately detected.

算出部３４は、話者Ｓの母語に応じて設定される子音の発音分類を考慮して、編集距離を算出し、該編集距離から誤り率を算出する。ここで、子音の発音類似性は、話者Ｓの母語によって異なる。ゆえに、このような構成によれば、例えば、話者Ｓの母語に応じた子音の発音類似性が考慮されて、想定文字列と発音が類似する発音文字列ほど想定文字列との誤り率が小さくなるように、誤り率が算出される。例えば、話者Ｓの母語が英語であるとき、英語を母語とする話者Ｓにとって発音が類似する子音「ｒ」及び子音「ｔ」が、追加の発音分類として、子音の発音分類に付け加えられ、編集距離及び誤り率が算出される。これにより、話者の母語を考慮して、想定文字列に発音が近い発話文字列（「単語の誤用というよりも単なる発音の誤りである可能性が高い」文字列）を発音誤り文字列として検出することができる。 The calculation unit 34 calculates the edit distance in consideration of the pronunciation classification of the consonants set according to the native language of the speaker S, and calculates the error rate from the edit distance. Here, the pronunciation similarity of the consonants differs depending on the native language of the speaker S. Therefore, according to this configuration, for example, the pronunciation similarity of the consonants according to the native language of the speaker S is taken into consideration, and the error rate is calculated so that the error rate with the expected string becomes smaller as the pronunciation string becomes more similar to the expected string. For example, when the native language of the speaker S is English, the consonants "r" and "t", which are similar in pronunciation to the speaker S whose native language is English, are added to the pronunciation classification of the consonants as additional pronunciation classifications, and the edit distance and error rate are calculated. In this way, it is possible to detect a spoken string whose pronunciation is close to the expected string (a string that is "more likely to be a mere pronunciation error rather than a misuse of a word") as a mispronounced string in consideration of the native language of the speaker.

算出部３４は、想定文字列の文字数をＮとして、変換部によって変換された発音を表す文字列から一又は複数のＮ－ｇｒａｍを発話文字列として特定する。ここで、発音の誤りの検出に関して、例えば、変換された発音を表す文字列が、所定の文字数の複数の文字列に切り分けられて、発話文字列とされた場合には、発音の誤りが複数の文字列に跨っているような場合に、発音の誤りが適切に検出されないことがある。この点、本実施形態に係る構成では、変換された発音を表す文字列から一又は複数のＮ－ｇｒａｍが発話文字列として特定される。これにより、Ｎ－ｇｒａｍは、変換された発音を表す文字列において、最初の文字を１文字ずつずらしながら重複を許して抜き出された所定の文字数の文字列であるため、話者の発話における発音誤りを漏れなく検出することができる。 The calculation unit 34 identifies one or more N-grams as spoken strings from the string representing the pronunciation converted by the conversion unit, where the number of characters in the expected string is N. Regarding detection of pronunciation errors, for example, if the string representing the converted pronunciation is divided into multiple strings of a predetermined number of characters to be used as spoken strings, the pronunciation error may not be properly detected if the pronunciation error spans multiple strings. In this regard, in the configuration according to the present embodiment, one or more N-grams are identified as spoken strings from the string representing the converted pronunciation. As a result, since the N-gram is a string of a predetermined number of characters extracted from the string representing the converted pronunciation while shifting the first character by one character and allowing overlaps, it is possible to detect all pronunciation errors in the speaker's speech.

変換された発音を表す文字列において、検出部３５にて検出された発音誤り文字列が、該発音誤り文字列の検出に用いられた編集距離の算出に用いられた想定文字列に置換され、変換された発音を表す文字列が、発話において用いられた言語の文章に変換される。検出部３５によって、発音誤り文字列が検出された場合、採点部３７によって、訂正部３６において変換された文章が採点され、検出部３５によって、発音誤り文字列が検出されない場合は、採点部３７によって、音声認識部３２において音声認識された結果が採点される。このような構成では、変換された発音文字列において、「単語の誤用というよりも単なる発音の誤りである可能性が高い」とされた文字列が、「発音を誤らなかった場合に発話されたと推定される文字列」である想定文字列に訂正され、発音誤りのみが訂正された文章が採点される。これにより、発話において、話者Ｓが単語を誤用してしまった場合と、話者が単語を正しく用いているにも関わらず、発音を誤ってしまった場合と、を区別して採点することができる。 In the character string representing the converted pronunciation, the mispronunciation character string detected by the detection unit 35 is replaced with the assumed character string used to calculate the edit distance used to detect the mispronunciation character string, and the character string representing the converted pronunciation is converted into a sentence in the language used in the utterance. If the detection unit 35 detects a mispronunciation character string, the scoring unit 37 scores the converted sentence in the correction unit 36, and if the detection unit 35 does not detect a mispronunciation character string, the scoring unit 37 scores the result of the speech recognition in the speech recognition unit 32. In this configuration, in the converted pronunciation character string, a character string that is "more likely to be a mere pronunciation error than a misuse of a word" is corrected to an assumed character string that is "a character string that is estimated to have been uttered if the pronunciation error had not occurred," and the sentence in which only the pronunciation error has been corrected is scored. This makes it possible to distinguish between a case in which the speaker S misused a word in an utterance and a case in which the speaker mispronounced a word despite using the word correctly, and to score the result.

図７に示される例では、従来技術及び本実施形態に係る技術における、発話内容の訂正結果と採点結果が示されている。まず、従来技術では、「ビター」及び「モター」は、共に誤りとして検出され、訂正される。一方で、本実施形態に係る技術では、「ビター」のみが、「ギター」の発音誤りであると推定され、訂正される。したがって、本実施形態に係る技術では、従来技術と比較して、発音誤りが単語の誤用であると誤認されないため、話者Ｓの発話がより正確に採点される。 The example shown in FIG. 7 shows the correction results and scoring results of the speech content in the conventional technology and the technology according to this embodiment. First, in the conventional technology, both "bitter" and "mottar" are detected as errors and are corrected. On the other hand, in the technology according to this embodiment, only "bitter" is estimated to be a mispronunciation of "guitar" and is corrected. Therefore, compared to the conventional technology, in the technology according to this embodiment, the speech of speaker S is scored more accurately because the mispronunciation is not mistaken for a misuse of the word.

本発明は、上記実施形態に限定されない。具体的には、発話評価システム１が評価する言語は、母音及び子音で構成された言語であればよく、日本語に限定されない。そして、変換部３３における変換後の文字列は、発音を表す文字列であればよく、例えば、発音記号やアルファベットで表されてもよい。 The present invention is not limited to the above embodiment. Specifically, the language evaluated by the speech evaluation system 1 may be any language composed of vowels and consonants, and is not limited to Japanese. The character string converted by the conversion unit 33 may be any character string that represents a pronunciation, and may be expressed, for example, in phonetic symbols or the alphabet.

次に、発話評価システム１に含まれた通信端末１０、及び、発話評価サーバ３０のハードウェア構成について、図８を参照して説明する。上述の通信端末１０、及び、発話評価サーバ３０は、物理的には、プロセッサ１００１、メモリ１００２、ストレージ１００３、通信装置１００４、入力装置１００５、出力装置１００６、バス１００７などを含むコンピュータ装置として構成されてもよい。 Next, the hardware configuration of the communication terminal 10 and the speech evaluation server 30 included in the speech evaluation system 1 will be described with reference to FIG. 8. The communication terminal 10 and the speech evaluation server 30 described above may be physically configured as a computer device including a processor 1001, a memory 1002, a storage 1003, a communication device 1004, an input device 1005, an output device 1006, a bus 1007, etc.

なお、以下の説明では、「装置」という文言は、回路、デバイス、ユニットなどに読み替えることができる。通信端末１０、及び、発話評価サーバ３０のハードウェア構成は、図に示した各装置を１つ又は複数含むように構成されてもよいし、一部の装置を含まずに構成されてもよい。 In the following description, the term "apparatus" may be interpreted as a circuit, device, unit, etc. The hardware configuration of the communication terminal 10 and the speech evaluation server 30 may be configured to include one or more of the devices shown in the figure, or may be configured to exclude some of the devices.

通信端末１０、及び、発話評価サーバ３０における各機能は、プロセッサ１００１、メモリ１００２などのハードウェア上に所定のソフトウェア（プログラム）を読み込ませることで、プロセッサ１００１が演算を行い、通信装置１００４による通信や、メモリ１００２及びストレージ１００３におけるデータの読み出し及び／又は書き込みを制御することで実現される。 The functions of the communication terminal 10 and the speech evaluation server 30 are realized by loading specific software (programs) onto hardware such as the processor 1001 and memory 1002, causing the processor 1001 to perform calculations and control communication by the communication device 1004 and the reading and/or writing of data in the memory 1002 and storage 1003.

プロセッサ１００１は、例えば、オペレーティングシステムを動作させてコンピュータ全体を制御する。プロセッサ１００１は、周辺装置とのインターフェース、制御装置、演算装置、レジスタなどを含む中央処理装置（ＣＰＵ：Central Processing Unit）で構成されてもよい。例えば、発話評価サーバ３０の音声認識部３２等の制御機能はプロセッサ１００１で実現されてもよい。 The processor 1001, for example, operates an operating system to control the entire computer. The processor 1001 may be configured as a central processing unit (CPU) including an interface with peripheral devices, a control device, an arithmetic unit, a register, etc. For example, the control functions of the voice recognition unit 32 of the speech evaluation server 30, etc. may be realized by the processor 1001.

また、プロセッサ１００１は、プログラム（プログラムコード）、ソフトウェアモジュールやデータを、ストレージ１００３及び／又は通信装置１００４からメモリ１００２に読み出し、これらに従って各種の処理を実行する。プログラムとしては、上述の実施の形態で説明した動作の少なくとも一部をコンピュータに実行させるプログラムが用いられる。 The processor 1001 also reads out programs (program codes), software modules, and data from the storage 1003 and/or the communication device 1004 into the memory 1002, and executes various processes according to these. The programs used are those that cause a computer to execute at least some of the operations described in the above embodiments.

例えば、発話評価サーバ３０の音声認識部３２等の制御機能は、メモリ１００２に格納され、プロセッサ１００１で動作する制御プログラムによって実現されてもよく、他の機能ブロックについても同様に実現されてもよい。上述の各種処理は、１つのプロセッサ１００１で実行される旨を説明してきたが、２以上のプロセッサ１００１により同時又は逐次に実行されてもよい。プロセッサ１００１は、１以上のチップで実装されてもよい。なお、プログラムは、電気通信回線を介してネットワークから送信されても良い。 For example, the control functions of the speech recognition unit 32 of the speech evaluation server 30 may be realized by a control program stored in the memory 1002 and running on the processor 1001, and the other functional blocks may be realized in a similar manner. Although the above-mentioned various processes have been described as being executed by one processor 1001, they may be executed simultaneously or sequentially by two or more processors 1001. The processor 1001 may be implemented in one or more chips. The program may be transmitted from a network via a telecommunications line.

メモリ１００２は、コンピュータ読み取り可能な記録媒体であり、例えば、ＲＯＭ（Read Only Memory）、ＥＰＲＯＭ（Erasable Programmable ＲＯＭ）、ＥＥＰＲＯＭ（Electrically Erasable Programmable ＲＯＭ）、ＲＡＭ（Random Access Memory）などの少なくとも１つで構成されてもよい。メモリ１００２は、レジスタ、キャッシュ、メインメモリ（主記憶装置）などと呼ばれてもよい。メモリ１００２は、本発明の一実施の形態に係る無線通信方法を実施するために実行可能なプログラム（プログラムコード）、ソフトウェアモジュールなどを保存することができる。 The memory 1002 is a computer-readable recording medium, and may be composed of at least one of, for example, a ROM (Read Only Memory), an EPROM (Erasable Programmable ROM), an EEPROM (Electrically Erasable Programmable ROM), a RAM (Random Access Memory), etc. The memory 1002 may also be called a register, a cache, a main memory (primary storage device), etc. The memory 1002 can store executable programs (program codes), software modules, etc. for implementing a wireless communication method according to one embodiment of the present invention.

ストレージ１００３は、コンピュータ読み取り可能な記録媒体であり、例えば、ＣＤＲＯＭ（Compact Disc ＲＯＭ）などの光ディスク、ハードディスクドライブ、フレキシブルディスク、光磁気ディスク(例えば、コンパクトディスク、デジタル多用途ディスク、Ｂｌｕ－ｒａｙ（登録商標）ディスク)、スマートカード、フラッシュメモリ(例えば、カード、スティック、キードライブ)、フロッピー（登録商標）ディスク、磁気ストリップなどの少なくとも１つで構成されてもよい。ストレージ１００３は、補助記憶装置と呼ばれてもよい。上述の記憶媒体は、例えば、メモリ１００２及び／又はストレージ１００３を含むデータベース、サーバその他の適切な媒体であってもよい。 Storage 1003 is a computer-readable recording medium, and may be, for example, at least one of an optical disk such as a CD-ROM (Compact Disc ROM), a hard disk drive, a flexible disk, a magneto-optical disk (e.g., a compact disk, a digital versatile disk, a Blu-ray (registered trademark) disk), a smart card, a flash memory (e.g., a card, a stick, a key drive), a floppy (registered trademark) disk, a magnetic strip, and the like. Storage 1003 may also be referred to as an auxiliary storage device. The above-mentioned storage medium may be, for example, a database, a server, or other suitable medium including memory 1002 and/or storage 1003.

通信装置１００４は、有線及び／又は無線ネットワークを介してコンピュータ間の通信を行うためのハードウェア（送受信デバイス）であり、例えばネットワークデバイス、ネットワークコントローラ、ネットワークカード、通信モジュールなどともいう。 The communication device 1004 is hardware (transmitting/receiving device) for communicating between computers via a wired and/or wireless network, and is also referred to as, for example, a network device, a network controller, a network card, a communication module, etc.

入力装置１００５は、外部からの入力を受け付ける入力デバイス（例えば、キーボード、マウス、マイクロフォン、スイッチ、ボタン、センサなど）である。出力装置１００６は、外部への出力を実施する出力デバイス（例えば、ディスプレイ、スピーカー、LEDランプなど）である。なお、入力装置１００５及び出力装置１００６は、一体となった構成（例えば、タッチパネル）であってもよい。 The input device 1005 is an input device (e.g., a keyboard, a mouse, a microphone, a switch, a button, a sensor, etc.) that accepts input from the outside. The output device 1006 is an output device (e.g., a display, a speaker, an LED lamp, etc.) that performs output to the outside. Note that the input device 1005 and the output device 1006 may be integrated into one configuration (e.g., a touch panel).

また、プロセッサ１００１やメモリ１００２などの各装置は、情報を通信するためのバス１００７で接続される。バス１００７は、単一のバスで構成されてもよいし、装置間で異なるバスで構成されてもよい。 In addition, each device, such as the processor 1001 and memory 1002, is connected by a bus 1007 for communicating information. The bus 1007 may be configured as a single bus, or may be configured as different buses between the devices.

また、通信端末１０、及び、発話評価サーバ３０は、マイクロプロセッサ、デジタル信号プロセッサ（ＤＳＰ：Digital Signal Processor）、ＡＳＩＣ（Application Specific Integrated Circuit）、ＰＬＤ（Programmable Logic Device）、ＦＰＧＡ（Field Programmable Gate Array）などのハードウェアを含んで構成されてもよく、該ハードウェアにより、各機能ブロックの一部又は全てが実現されてもよい。例えば、プロセッサ１００１は、これらのハードウェアの少なくとも１つで実装されてもよい。 The communication terminal 10 and the speech evaluation server 30 may be configured to include hardware such as a microprocessor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a programmable logic device (PLD), or a field programmable gate array (FPGA), and some or all of the functional blocks may be realized by the hardware. For example, the processor 1001 may be implemented by at least one of these pieces of hardware.

以上、本実施形態について詳細に説明したが、当業者にとっては、本実施形態が本明細書中に説明した実施形態に限定されるものではないということは明らかである。本実施形態は、特許請求の範囲の記載により定まる本発明の趣旨及び範囲を逸脱することなく修正及び変更態様として実施することができる。したがって、本明細書の記載は、例示説明を目的とするものであり、本実施形態に対して何ら制限的な意味を有するものではない。 Although the present embodiment has been described in detail above, it is clear to those skilled in the art that the present embodiment is not limited to the embodiment described in this specification. The present embodiment can be implemented in modified and altered forms without departing from the spirit and scope of the present invention as defined by the claims. Therefore, the description in this specification is intended as an illustrative example and does not have any restrictive meaning with respect to the present embodiment.

例えば、発話評価システム１は、通信端末１０、及び、発話評価サーバ３０を含んで構成されているとして説明したが、これに限定されず、発話評価システム１の各機能が、発話評価サーバ３０のみによって実現されてもよい。 For example, the speech evaluation system 1 has been described as including a communication terminal 10 and a speech evaluation server 30, but is not limited to this, and each function of the speech evaluation system 1 may be realized only by the speech evaluation server 30.

本明細書で説明した各態様／実施形態は、ＬＴＥ（Long Term Evolution）、ＬＴＥ－Ａ（LTE-Advanced）、ＳＵＰＥＲ３Ｇ、ＩＭＴ－Ａｄｖａｎｃｅｄ、４Ｇ、５Ｇ、ＦＲＡ（Future Radio Access）、Ｗ－ＣＤＭＡ（登録商標）、ＧＳＭ（登録商標）、ＣＤＭＡ２０００、ＵＭＢ（Ultra Mobile Broad-band）、ＩＥＥＥ８０２．１１（Ｗｉ－Ｆｉ）、ＩＥＥＥ８０２．１６（ＷｉＭＡＸ）、ＩＥＥＥ８０２．２０、ＵＷＢ（Ultra-Wide Band）、Ｂｌｕｅｔｏｏｔｈ（登録商標）、その他の適切なシステムを利用するシステム及び／又はこれらに基づいて拡張された次世代システムに適用されてもよい。 Each aspect/embodiment described herein may be applied to systems utilizing LTE (Long Term Evolution), LTE-Advanced (LTE-A), SUPER 3G, IMT-Advanced, 4G, 5G, FRA (Future Radio Access), W-CDMA (registered trademark), GSM (registered trademark), CDMA2000, UMB (Ultra Mobile Broadband), IEEE 802.11 (Wi-Fi), IEEE 802.16 (WiMAX), IEEE 802.20, UWB (Ultra-Wide Band), Bluetooth (registered trademark), or other suitable systems and/or next generation systems based on and enhanced thereon.

本明細書で説明した各態様／実施形態の処理手順、シーケンス、フローチャートなどは、矛盾の無い限り、順序を入れ替えてもよい。例えば、本明細書で説明した方法については、例示的な順序で様々なステップの要素を提示しており、提示した特定の順序に限定されない。 The steps, sequences, flow charts, etc. of each aspect/embodiment described herein may be reordered unless inconsistent. For example, the methods described herein present elements of various steps in an example order and are not limited to the particular order presented.

入出力された情報等は特定の場所(例えば、メモリ)に保存されてもよいし、管理テーブルで管理してもよい。入出力される情報等は、上書き、更新、または追記され得る。出力された情報等は削除されてもよい。入力された情報等は他の装置へ送信されてもよい。 The input and output information may be stored in a specific location (e.g., memory) or may be managed in a management table. The input and output information may be overwritten, updated, or added to. The output information may be deleted. The input information may be sent to another device.

判定は、１ビットで表される値（０か１か）によって行われてもよいし、真偽値（Boolean：trueまたはfalse）によって行われてもよいし、数値の比較（例えば、所定の値との比較）によって行われてもよい。 The determination may be based on a value represented by one bit (0 or 1), a Boolean (true or false) value, or a numerical comparison (e.g., with a predetermined value).

本明細書で説明した各態様／実施形態は単独で用いてもよいし、組み合わせて用いてもよいし、実行に伴って切り替えて用いてもよい。また、所定の情報の通知（例えば、「Ｘであること」の通知）は、明示的に行うものに限られず、暗黙的（例えば、該所定の情報の通知を行わない）ことによって行われてもよい。 Each aspect/embodiment described in this specification may be used alone, in combination, or switched depending on the execution. In addition, notification of specific information (e.g., notification that "X is the case") is not limited to being done explicitly, but may be done implicitly (e.g., not notifying the specific information).

ソフトウェアは、ソフトウェア、ファームウェア、ミドルウェア、マイクロコード、ハードウェア記述言語と呼ばれるか、他の名称で呼ばれるかを問わず、命令、命令セット、コード、コードセグメント、プログラムコード、プログラム、サブプログラム、ソフトウェアモジュール、アプリケーション、ソフトウェアアプリケーション、ソフトウェアパッケージ、ルーチン、サブルーチン、オブジェクト、実行可能ファイル、実行スレッド、手順、機能などを意味するよう広く解釈されるべきである。 Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, executable files, threads of execution, procedures, functions, etc., whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise.

また、ソフトウェア、命令などは、伝送媒体を介して送受信されてもよい。例えば、ソフトウェアが、同軸ケーブル、光ファイバケーブル、ツイストペア及びデジタル加入者回線（ＤＳＬ）などの有線技術及び／又は赤外線、無線及びマイクロ波などの無線技術を使用してウェブサイト、サーバ、又は他のリモートソースから送信される場合、これらの有線技術及び／又は無線技術は、伝送媒体の定義内に含まれる。 Software, instructions, etc. may also be transmitted and received over a transmission medium. For example, if the software is transmitted from a website, server, or other remote source using wired technologies, such as coaxial cable, fiber optic cable, twisted pair, and digital subscriber line (DSL), and/or wireless technologies, such as infrared, radio, and microwave, these wired and/or wireless technologies are included within the definition of a transmission medium.

本明細書で説明した情報、信号などは、様々な異なる技術のいずれか１項を使用して表されてもよい。例えば、上記の説明全体に渡って言及され得るデータ、命令、コマンド、情報、信号、ビット、シンボル、チップなどは、電圧、電流、電磁波、磁界若しくは磁性粒子、光場若しくは光子、又はこれらの任意の組み合わせによって表されてもよい。 The information, signals, etc. described herein may be represented using any one of a variety of different technologies. For example, data, instructions, commands, information, signals, bits, symbols, chips, etc. that may be referred to throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or magnetic particles, optical fields or photons, or any combination thereof.

なお、本明細書で説明した用語及び／又は本明細書の理解に必要な用語については、同一の又は類似する意味を有する用語と置き換えてもよい。 Note that terms explained in this specification and/or terms necessary for understanding this specification may be replaced with terms having the same or similar meanings.

また、本明細書で説明した情報、パラメータなどは、絶対値で表されてもよいし、所定の値からの相対値で表されてもよいし、対応する別の情報で表されてもよい。 In addition, the information, parameters, etc. described in this specification may be expressed as absolute values, as relative values from a predetermined value, or as corresponding other information.

通信端末は、当業者によって、移動通信端末、加入者局、モバイルユニット、加入者ユニット、ワイヤレスユニット、リモートユニット、モバイルデバイス、ワイヤレスデバイス、ワイヤレス通信デバイス、リモートデバイス、モバイル加入者局、アクセス端末、モバイル端末、ワイヤレス端末、リモート端末、ハンドセット、ユーザエージェント、モバイルクライアント、クライアント、またはいくつかの他の適切な用語で呼ばれる場合もある。 A communications terminal may also be referred to by those skilled in the art as a mobile communications terminal, subscriber station, mobile unit, subscriber unit, wireless unit, remote unit, mobile device, wireless device, wireless communications device, remote device, mobile subscriber station, access terminal, mobile terminal, wireless terminal, remote terminal, handset, user agent, mobile client, client, or some other suitable terminology.

本明細書で使用する「に基づいて」という記載は、別段に明記されていない限り、「のみに基づいて」を意味しない。言い換えれば、「に基づいて」という記載は、「のみに基づいて」と「に少なくとも基づいて」の両方を意味する。 As used herein, the phrase "based on" does not mean "based only on," unless expressly stated otherwise. In other words, the phrase "based on" means both "based only on" and "based at least on."

本明細書で「第１の」、「第２の」などの呼称を使用した場合においては、その要素へのいかなる参照も、それらの要素の量または順序を全般的に限定するものではない。これらの呼称は、２つ以上の要素間を区別する便利な方法として本明細書で使用され得る。したがって、第１および第２の要素への参照は、２つの要素のみがそこで採用され得ること、または何らかの形で第１の要素が第２の要素に先行しなければならないことを意味しない。 When designations such as "first," "second," and the like are used herein, any reference to that element is not intended to generally limit the quantity or order of those elements. These designations may be used herein as a convenient way to distinguish between two or more elements. Thus, a reference to a first and a second element does not imply that only two elements may be employed therein or that the first element must precede the second element in some way.

「含む（include）」、「含んでいる（including）」、およびそれらの変形が、本明細書あるいは特許請求の範囲で使用されている限り、これら用語は、用語「備える(comprising)」と同様に、包括的であることが意図される。さらに、本明細書あるいは特許請求の範囲において使用されている用語「または（or）」は、排他的論理和ではないことが意図される。 To the extent that the terms "include," "including," and variations thereof are used herein in the specification or claims, these terms are intended to be inclusive, similar to the term "comprising." Further, the term "or" as used herein is not intended to be an exclusive or.

本明細書において、文脈または技術的に明らかに1つのみしか存在しない装置である場合以外は、複数の装置をも含むものとする。 In this specification, multiple devices are included unless the context or technical basis clearly indicates that only one device exists.

本開示の全体において、文脈から明らかに単数を示したものではなければ、複数のものを含むものとする。 Throughout this disclosure, the plural is intended to be included unless the context clearly indicates the singular.

Ｓ…話者、１…発話評価システム、１０…通信端末、３０…発話評価サーバ、３１…記憶部、３２…音声認識部、３３…変換部、３４…算出部、３５…検出部、３６…訂正部、３７…採点部、３８…出力部。

S...speaker, 1...speech evaluation system, 10...communication terminal, 30...speech evaluation server, 31...storage unit, 32...speech recognition unit, 33...conversion unit, 34...calculation unit, 35...detection unit, 36...correction unit, 37...scoring unit, 38...output unit.

Claims

a conversion unit that acquires a result of speech recognition of a speaker's speech and converts the result of speech recognition into a character string representing a pronunciation;
a calculation unit that calculates an edit distance between an expected character string that is a character string representing the pronunciation of a word that is expected to appear in the speech of the speaker and one or more spoken character strings included in a character string representing the pronunciation converted by the conversion unit;
a detection unit that detects, from among the one or more spoken character strings, a spoken character string whose edit distance is equal to or smaller than a predetermined value and which is not identical to the expected character string, as a mispronounced character string;
an output unit that outputs the mispronounced character string,
The calculation unit specifies, as the spoken character string, one or more N-grams from a character string representing the pronunciation converted by the conversion unit, where N is the number of characters in the expected character string ;
The calculation unit calculates the edit distance in consideration of pronunciation classifications of consonants included in the expected character string and the spoken character string.

the calculation unit calculates the edit distance in consideration of a pronunciation classification of the consonants that is set according to the native language of the speaker.
The speech evaluation system according to claim 1 .

a correction unit that corrects the mispronounced character string detected by the detection unit in the character string representing the converted pronunciation to the expected character string used in calculating the edit distance used to detect the mispronounced character string, and converts the character string representing the converted pronunciation into a sentence in a language used in the utterance,
The output unit outputs the sentence.
The speech evaluation system according to claim 1 or 2 .

a scoring unit that, when the detection unit detects the mispronounced character string, scores the sentence converted by the correction unit, and, when the detection unit does not detect the mispronounced character string, scores a result of the speech recognition,
The output unit further outputs the scoring result in the scoring unit.
The speech evaluation system according to claim 3 .

a conversion unit that acquires a result of speech recognition of a speaker's speech and converts the result of speech recognition into a character string representing a pronunciation;
a calculation unit that calculates an edit distance between an expected character string that is a character string representing the pronunciation of a word that is expected to appear in the speech of the speaker and one or more spoken character strings included in a character string representing the pronunciation converted by the conversion unit;
a detection unit that detects, from among the one or more spoken character strings, a spoken character string whose edit distance is equal to or smaller than a predetermined value and which is not identical to the expected character string, as a mispronounced character string;
a correction unit that corrects the mispronounced character string detected by the detection unit in the character string representing the converted pronunciation to the assumed character string used in calculating the edit distance used to detect the mispronounced character string, and converts the character string representing the converted pronunciation into a sentence in the language used in the utterance ;
an output unit that outputs the sentence ,
The calculation unit calculates the edit distance in consideration of pronunciation classifications of consonants included in the expected character string and the spoken character string.