JP2011257529A

JP2011257529A - Method, device and program for holding-related utterance extraction

Info

Publication number: JP2011257529A
Application number: JP2010130824A
Authority: JP
Inventors: Takaaki Fukutomi; 隆朗福冨; Tsubasa Shinozaki; 翼篠崎; Osamu Yoshioka; 理吉岡; Satoshi Takahashi; 敏高橋
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-06-08
Filing date: 2010-06-08
Publication date: 2011-12-22

Abstract

PROBLEM TO BE SOLVED: To provide a technology capable of extracting an utterance related to holding more appropriately.SOLUTION: A voice feature amount calculation unit 2 extracts a voice feature amount of a voice signal. A voice recognition unit 3 uses the voice feature amount, an acoustic model and a language model to execute voice recognition for the voice signal, to detect an utterance included in the voice signal and to generate information on the detected utterance. A holding interval detection unit 4 uses the information on the utterance to detect a holding interval between adjacent utterances which is equal to or longer than a predetermined time. An extraction unit 5 extracts more utterances from a group of utterances adjacent to the holding interval when the holding interval is longer.

Description

この発明は、保留に関連する発話を抽出する技術に関する。 The present invention relates to a technique for extracting utterances related to holding.

通話を音声認識技術によりテキスト化し、テキスト処理技術によって通話の中での重要語を抽出する方法がある（例えば、非特許文献１参照。）。 There is a method in which a phone call is converted into text by a voice recognition technology and an important word in the phone call is extracted by a text processing technology (for example, see Non-Patent Document 1).

従来は、保留が行われた通話の全体に対して、この非特許文献１に記載された方法を適用することにより重要語を抽出して、保留の原因を探っていた。 Conventionally, an important word is extracted by applying the method described in Non-Patent Document 1 to the entire call on hold, and the cause of the hold is searched.

徳永健伸（著），辻井潤一（編集），「言語と計算（５）情報検索と言語処理」，東京大学出版会，１９９９年１１月Takenobu Tokunaga (Author), Junichi Sakurai (Editor), "Language and Calculation (5) Information Retrieval and Language Processing", The University of Tokyo Press, November 1999

しかしながら、保留が行われた通話の全体から抽出された重要語は、保留とは全く関係ない話題に関係する単語である場合があるという問題があった。 However, there is a problem that the important word extracted from the entire call on hold is a word related to a topic that has nothing to do with hold.

この発明は、より適切に保留に関連する発話を抽出することを目的とする。 An object of the present invention is to more appropriately extract utterances related to holding.

上記の課題を解決するために、音声信号の音声特徴量を抽出する。音声特徴量、音響モデル及び言語モデルを用いて上記音声信号に対して音声認識を行い、音声信号に含まれる発話を検出し、検出された発話についての情報を生成する。発話についての情報を用いて、隣接する発話の間隔が所定の時間以上である区間を保留区間として検出する。保留区間に隣接する発話の集合から、保留区間が長いほど多くの数の発話を抽出する。 In order to solve the above-described problem, a voice feature amount of a voice signal is extracted. Voice recognition is performed on the voice signal using a voice feature, an acoustic model, and a language model, an utterance included in the voice signal is detected, and information about the detected utterance is generated. Using the information about the utterance, a section where the interval between adjacent utterances is equal to or longer than a predetermined time is detected as a reserved section. From the set of utterances adjacent to the reserved section, a larger number of utterances are extracted as the reserved section is longer.

より適切に保留に関連する発話を抽出することができる。 It is possible to more appropriately extract utterances related to holding.

保留関連発話抽出装置の例の機能ブロック図。The functional block diagram of the example of a pending | holding related speech extraction apparatus. 保留関連発話抽出方法の例を示す流れ図。The flowchart which shows the example of the hold related utterance extraction method. ステップＳ４の例を示す流れ図。The flowchart which shows the example of step S4. 保留区間の検出の例を説明するための図。The figure for demonstrating the example of a detection of a pending | holding area. 保留関連発話抽出の例を説明するための図。The figure for demonstrating the example of a holding | maintenance related speech extraction.

以下、図面を参照してこの発明の一実施形態を説明する。 An embodiment of the present invention will be described below with reference to the drawings.

保留関連発話抽出装置は、図１に示すように、音声信号取得部１、音声特徴量算出部２、音声認識部３、保留区間検出部４、抽出部５を例えば含む。この保留関連発話抽出装置が、図２に例示する保留関連発話抽出方法の各ステップを実行する。 As shown in FIG. 1, the hold-related utterance extraction device includes, for example, a voice signal acquisition unit 1, a voice feature amount calculation unit 2, a voice recognition unit 3, a hold section detection unit 4, and an extraction unit 5. This hold-related utterance extraction device executes each step of the hold-related utterance extraction method illustrated in FIG.

音声取得部１は、入力されたアナログ音声信号をＡ／Ｄ変換して、ディジタル音声信号を生成する（ステップＳ１）。ディジタル音声信号は、音声特徴量抽出部２に送られる。音声取得部１に入力されるアナログ音声信号は、複数チャネルにそれぞれ対応する複数のアナログ音声信号である。この例では、チャネル数は２であり、一方がオペレータの音声のチャネルＡ、他方が顧客の音声のチャネルＢであるとする。 The voice acquisition unit 1 performs A / D conversion on the input analog voice signal to generate a digital voice signal (step S1). The digital voice signal is sent to the voice feature quantity extraction unit 2. The analog audio signals input to the audio acquisition unit 1 are a plurality of analog audio signals respectively corresponding to a plurality of channels. In this example, it is assumed that the number of channels is 2, one of which is channel A for operator's voice and the other is channel B for customer's voice.

音声特徴量抽出部２は、ディジタル音声信号の音声特徴量を抽出する（ステップＳ２）。抽出された音声特徴量についての情報は、音声認識部３に送られる。音声特徴量は、例えばＭＦＣＣ（Mel-Frequency Cepstrum Coefficient）、ＭＦＣＣの変化量であるΔＭＦＣＣであり、後述する音声認識部３で用いることができるものであればよい。音声特徴量の抽出は、既存の技術を用いればよい。 The voice feature quantity extraction unit 2 extracts the voice feature quantity of the digital voice signal (step S2). Information about the extracted voice feature amount is sent to the voice recognition unit 3. The voice feature amount is, for example, MFCC (Mel-Frequency Cepstrum Coefficient) or ΔMFCC which is the amount of change in MFCC, and may be anything that can be used by the voice recognition unit 3 described later. An existing technique may be used to extract the voice feature amount.

音声認識部３は、音声特徴量、音響モデル及び言語モデルを用いて、音声信号に対して音声認識を行い、音声信号に含まれる発話を検出し、検出された発話についての情報を生成する（ステップＳ３）。検出された発話についての情報は、保留区間検出部４及び抽出部５に送られる。音声認識は、既存の技術を用いればよい。後述する入電フレーズ及び切電フレーズが認識できれば十分であるため、比較的軽い処理の音声認識技術を用いればよい。 The speech recognition unit 3 performs speech recognition on the speech signal using the speech feature value, the acoustic model, and the language model, detects an utterance included in the speech signal, and generates information about the detected utterance ( Step S3). Information about the detected utterance is sent to the hold section detection unit 4 and the extraction unit 5. For voice recognition, existing technology may be used. Since it is sufficient to be able to recognize an incoming call phrase and a turn-off phrase, which will be described later, a relatively light processing speech recognition technique may be used.

発話についての情報とは、例えば、顧客の各発話Ｕｃｉ（ｉ＝１，２，…）の開始時刻Ｓｃｉ及び終了時刻Ｅｃｉ、オペレータの各発話Ｕｏｉ（ｉ＝１，２，…）の開始時刻Ｓｏｉ及び終了時刻Ｅｏｉ、顧客の各発話Ｕｃｉ（ｉ＝１，２，…）を構成するＭｃｉ個の単語の表記Ｗｃｉ１，Ｗｃｉ２，…，ＷｃｉＭｃｉ、これらの単語の品詞情報Ｐｃｉ１，Ｐｃｉ２，…，ＰｃｉＭｃｉ、オペレータの各発話Ｕｏｉ（ｉ＝１，２，…）を構成するＭｏｉ個の単語の表記Ｗｏｉ１，Ｗｏｉ２，…，ＷｏｉＭｏｉ、これらの単語の品詞情報Ｐｏｉ１，Ｐｏｉ２，…，ＰｏｉＭｃｉについての情報である。 The information about the utterance includes, for example, the start time Sci and end time Eci of each utterance Uci (i = 1, 2,...) Of the customer, and the start time Soi of each utterance Uoi (i = 1, 2,...) Of the operator. , End time Eoi, Mci word notation Wci1, Wci2,..., WciMci constituting each customer utterance Uci (i = 1, 2,...), Part of speech information Pci1, Pci2,..., PciMci, Information on Moi words constituting each utterance Uoi (i = 1, 2,...) Of the operator, Woi1, Woi2,.

保留区間検出部４は、発話についての情報を用いて、隣接する発話の間隔が所定の時間以上であり、この隣接する発話の少なくとも一方に保留時に用いられる典型的なフレーズが含まれている保留区間を検出する（ステップＳ４）。検出された保留区間についての情報は、抽出部５に送られる。 The holding section detection unit 4 uses information about an utterance, and the interval between adjacent utterances is a predetermined time or more, and at least one of the adjacent utterances includes a typical phrase used at the time of holding. A section is detected (step S4). Information about the detected pending section is sent to the extraction unit 5.

保留区間検出部４は、無音区間抽出部４１及び定型表現抽出部４２を含む。まず、無音区間抽出部４１が、発話についての情報を用いて、隣接する発話の間隔が所定の時間以上である無音区間を検出する。そして、定型表現抽出部４２が、検出された無音区間に隣接する発話の少なくとも一方に保留時に用いられる典型的なフレーズが含まれているかどうか判定する。含まれていれば、無音区間抽出部４１は、この無音区間を保留区間とする。 The reserved section detection unit 4 includes a silent section extraction unit 41 and a fixed expression extraction unit 42. First, the silent section extraction unit 41 detects a silent section in which the interval between adjacent utterances is equal to or longer than a predetermined time, using information about the utterance. Then, the fixed expression extraction unit 42 determines whether a typical phrase used at the time of holding is included in at least one of the utterances adjacent to the detected silent section. If included, the silent section extraction unit 41 sets this silent section as a reserved section.

発話にフレーズが含まれているかどうかは、フレーズを構成する単語がその発話にＭ個以上含まれているかどうかにより判定する。ここで閾値となるＭは、フレーズを構成する単語の総数Ｎとした場合、Ｍ＝┌Ｎ×ｋ┐のように求める。ただし、ｋは０以上１以下の任意の定数とし、┌・┐は・以上の最小の整数を表す。より正確に保留区間を抽出したい場合には、ｋを大きな値に設定し、より抽出漏れを少なくしたい場合にはｋを小さな値に設定するとよい。このようにフレーズを構成する単語を含む割合をもとに発話を抽出することで、全単語の一致を検出する場合よりも柔軟な検出が行える。単語がある発話に含まれるかどうかは、例えばその単語の表記及び品詞情報と同一の表記及び品詞情報を持つ単語がその発話の中に含まれるかどうかにより判定する。または、品詞情報を無視して、その単語の表記と同一の表記を持つ単語がその発話の中に含まれるかどうかにより判定してもよい。 Whether or not a phrase is included in an utterance is determined by whether or not M words or more constituting the phrase are included in the utterance. Here, the threshold value M is calculated as M = 求める N × k┐, where N is the total number of words constituting the phrase. However, k is an arbitrary constant of 0 or more and 1 or less, and ┌ / ┐ represents the smallest integer of ≧. If it is desired to extract the reserved section more accurately, k may be set to a large value, and if it is desired to reduce extraction omission, k may be set to a small value. Thus, by extracting the utterance based on the ratio including the words constituting the phrase, detection can be performed more flexibly than when all the words are matched. Whether or not a word is included in an utterance is determined by whether or not a word having the same notation and part of speech information as the notation and part of speech information of the word is included in the utterance. Alternatively, the part of speech information may be ignored and the determination may be made based on whether or not a word having the same notation as that word is included in the utterance.

保留時に用いられる典型的なフレーズとは、例えば「少々お待ち下さい」「お待たせしました」等が考えられる。「少々お待ち下さい」は、「少々：連用詞」「お：冠動詞」「待：動詞」…のように、複数の単語から構成されており、各単語の表記及び品詞情報は「表記：品詞情報」と表される。これらの表記、品詞情報の少なくとも一方を用いて、単語が発話に含まれているかどうかを判定する。表記、品詞情報は既存の形態素解析技術により求めることができる。 As typical phrases used at the time of holding, for example, “please wait a little”, “sorry for you”, etc. can be considered. “Please wait a bit” is composed of multiple words like “a little: verbs”, “o: coronal verbs”, “waits: verbs”, etc. The notation and part of speech information for each word is “notation: part of speech” Information ". Using at least one of these notations and part-of-speech information, it is determined whether or not the word is included in the utterance. Notation and part-of-speech information can be obtained by existing morphological analysis techniques.

保留区間検出部４は、さらに具体的には、図３の処理を行う。以下、この図３の処理の説明をする。この図３の処理は、この例では、オペレータの発話Ｕｏｉ（ｉ＝１，２，…，Ｎｏ）のみを考慮して、保留区間の抽出を行っている。これは、相槌を行わない顧客がいること、保留メロディが顧客の電話機より流れることを考慮したものである。 More specifically, the reserved section detection unit 4 performs the process of FIG. Hereinafter, the process of FIG. 3 will be described. In the process of FIG. 3, in this example, the pending section is extracted in consideration of only the operator's utterance Uoi (i = 1, 2,..., No). This is due to the fact that there are customers who do not interact and that the hold melody flows from the customer's phone.

無音区間抽出部４１は、ｉ＝２，ｈ＝１として、ｉ及びｈを初期化する（ステップＳ４１）。 The silent section extraction unit 41 initializes i and h with i = 2 and h = 1 (step S41).

無音区間抽出部４１は、ｉ＞Ｎｏであるか判定する（ステップＳ４２）。Ｎｏは、オペレータの発話の総数である。 The silent section extraction unit 41 determines whether i> No (step S42). No is the total number of utterances by the operator.

ｉ＞Ｎｏでなければ。無音区間抽出部４１は、オペレータのｉ番目の発話Ｕｏｉとオペレータのｉ−１番目の発話Ｕｏ（ｉ−１）との間の間隔Ｅｏｉ−Ｓｏ（ｉ−１）が所定の時間Ｔｈより大であるか、Ｅｏｉ−Ｓｏ（ｉ−１）＞Ｔｈであるか判定する（ステップＳ４３）。 If i> No. The silent section extraction unit 41 has an interval Eoi-So (i-1) between the operator's i-th utterance Uoi and the operator's i-1th utterance Uo (i-1) greater than a predetermined time Th. It is determined whether or not Eoi-So (i-1)> Th (step S43).

Ｅｏｉ−Ｓｏ（ｉ−１）＞Ｔｈであれば、定型表現抽出部４２は、ｉ番目の発話Ｕｏｉ及びｉ−１番目の発話Ｕｏ（ｉ−１）の少なくとも一方に保留時に用いられる典型的なフレーズが含まれているかどうかを判定する（ステップＳ４４）。 If Eoi-So (i-1)> Th, the typical expression extraction unit 42 is typically used for holding at least one of the i-th utterance Uoi and the i-1th utterance Uo (i-1). It is determined whether or not a phrase is included (step S44).

含まれていれば、無音区間抽出部４１は、ｉ番目の発話Ｕｏｉとｉ−１番目の発話Ｕｏ（ｉ−１）との間の区間をｈ番目の保留区間とする（ステップＳ４５）。例えば、保留区間はオペレータの発話と顧客の発話とを合わせて通話の開始から何番目の発話の間にあるかにより特定される。ｉ−１番目の発話Ｕｏ（ｉ−１）の通話の開始から順番をＨｓｈとし、ｉ番目の発話Ｕｏｉの通話の開始からの順番をＨｅｈとすると、ｈ番目の保留区間はＨｓｈ〜Ｈｅｈと特定される。 If included, the silent section extraction unit 41 sets the section between the i-th utterance Uoi and the (i-1) th utterance Uo (i-1) as the h-th reserved section (step S45). For example, the holding section is specified by the number of utterances between the start of the call and the utterance of the operator and the utterance of the customer. If the order from the start of the call of the (i-1) th utterance Uo (i-1) is Hsh, and the order from the start of the call of the ith utterance Uoi is Heh, the hth hold section is specified as Hsh to Heh. Is done.

ステップＳ４５の後に、無音区間抽出部４１は、ｈ＝ｈ＋１として、ｈを１だけインクリメンする（ステップＳ４６）。 After step S45, the silent section extraction unit 41 increments h by 1 with h = h + 1 (step S46).

ステップＳ４６の後、ステップＳ４３においてＥｏｉ−Ｓｏ（ｉ−１）＞Ｔｈでないと判定された場合、又は、ステップＳ４４において典型的なフレーズが含まれていないと判定された場合、無音区間抽出部４１は、ｉ＝ｉ＋１として、ｉを１だけインクリメントする（ステップＳ４７）。ステップＳ４７の後は、ステップＳ４２に進む。 After step S46, if it is determined in step S43 that Eoi-So (i-1)> Th is not satisfied, or if it is determined in step S44 that a typical phrase is not included, the silent section extraction unit 41 Sets i = i + 1 and increments i by 1 (step S47). After step S47, the process proceeds to step S42.

ステップＳ４２において、ｉ＞Ｎｏであると判定された場合には、無音区間抽出部４１は、ｈが１であるか判定する（ステップＳ４８）。すなわち、保留区間が検出されたか判定する。ｈが１である場合には、保留区間が検出されなかったことを意味する。 If it is determined in step S42 that i> No, the silent section extraction unit 41 determines whether h is 1 (step S48). That is, it is determined whether a pending section is detected. When h is 1, it means that no reserved section has been detected.

ｈが１でない場合には、ステップＳ４の処理を終えてステップＳ５に進む。ｈが１である場合には、その後のステップＳ５の処理は行わない。 If h is not 1, the process of step S4 is finished and the process proceeds to step S5. If h is 1, the subsequent process of step S5 is not performed.

図４の例では、発話Ｕｏ２と発話Ｕｏ３との間の区間がＴｈ以上である。このため、この区間に隣接する発話Ｕｏ２とＵｏ３に保留時に用いられる典型的なフレーズが含まれているかどうかを判定する。フレーズが含まれていれば、この区間は保留区間１とされ、発話Ｕｏ２の通話の開始からの順番である３がＨｓ１とされ、通話Ｕｏ３の通話の開始からの順番である５がＨｅ１とされる。 In the example of FIG. 4, the section between the utterance Uo2 and the utterance Uo3 is equal to or greater than Th. Therefore, it is determined whether or not typical phrases used at the time of holding are included in the utterances Uo2 and Uo3 adjacent to this section. If the phrase is included, this section is set as the holding section 1, 3 that is the order from the start of the call of the utterance Uo2 is set to Hs1, and 5 that is the order from the start of the call of the call Uo3 is set to He1. The

抽出部５は、保留区間に隣接する発話の集合から、その発話区間が長いほど多くの数の発話を抽出する（ステップＳ５）。抽出された発話は、保留関連発話として、分析の対象となる。 The extraction unit 5 extracts a larger number of utterances as the utterance section is longer from the set of utterances adjacent to the reserved section (step S5). The extracted utterance is subject to analysis as a hold-related utterance.

図５の例では、保留区間が長い場合には保留区間に隣接する発話の集合から計９個の発話を抽出し、保留区間が短い場合には保留区間に隣接する発話の集合から計５個の発話を抽出している。 In the example of FIG. 5, when the holding section is long, a total of nine utterances are extracted from the set of utterances adjacent to the holding section, and when the holding section is short, a total of five utterances are extracted from the set of utterances adjacent to the holding section. Extracting utterances.

保留区間が長い場合には、保留の原因は複雑であると考えて、分析の対象を広く設定する。逆に、保留区間が短い場合には、保留の原因は簡単であると考えて、分析の対象を狭く設定する。このように、保留区間の長さに応じて分析の対象を伸縮させることにより、より適切に保留関連発話を抽出することができる。 When the holding section is long, the cause of the holding is considered to be complicated, and the analysis target is set widely. On the contrary, when the holding section is short, the cause of the holding is considered to be simple, and the analysis target is set to be narrow. In this manner, the hold-related utterance can be more appropriately extracted by expanding / contracting the analysis target according to the length of the hold section.

Ｋを所定の定数、ｔを保留区間の長さ、┌・┐を・以上の最小の整数として、例えば、┌Ｋ×ｔ┐個の発話を、保留区間に隣接する発話の集合から抽出する。例えば、Ｋ＝任意の発話数／平均的な保留時間とする。任意の発話数は、平均的な保留時間程度の保留を行った際に分析対象となる発話の数であり、例えば５である。 For example, ┌K × t┐ utterances are extracted from the set of utterances adjacent to the hold interval, where K is a predetermined constant, t is the length of the hold interval, and ┌ · ┐ is the smallest integer. For example, K = arbitrary number of utterances / average holding time. The arbitrary number of utterances is the number of utterances to be analyzed when holding is held for an average holding time, for example, five.

また、Ｋ’を所定の定数、ｔを保留区間の長さとして、保留区間に隣接する発話の集合であって、保留区間に隣接するＫ’×ｔ時間以内の時間長に含まれる発話を抽出してもよい。例えば、Ｋ’＝任意の時間長／平均的な保留時間とする。 Also, a set of utterances adjacent to the holding section and including a time length within K ′ × t time adjacent to the holding section is extracted, where K ′ is a predetermined constant and t is the length of the holding section. May be. For example, K ′ = arbitrary time length / average holding time.

保留区間に隣接する発話の集合とは、保留区間の直前にある発話の集合、保留区間の直後にある発話の集合、保留区間の直前及び直後にある発話の集合の何れかである。保留区間に隣接する発話の集合が、保留区間の直前及び直後にある発話の集合である場合には、保留区間の直前から抽出する保留関連発話の数と、保留区間の直後から抽出する保留関連発話の数とは、同数でも異なっていてもよい。 The set of utterances adjacent to the holding section is any of a set of utterances immediately before the holding section, a set of utterances immediately after the holding section, and a set of utterances immediately before and after the holding section. When the set of utterances adjacent to the hold section is a set of utterances immediately before and after the hold section, the number of hold related utterances extracted immediately before the hold section and the hold relation extracted from immediately after the hold section The number of utterances may be the same or different.

定型表現抽出部４２の処理を行わずに、保留区間検出部４は、隣接する発話の間隔が所定の時間以上である区間を保留区間としてもよい。 Instead of performing the processing of the standard expression extracting unit 42, the holding section detecting unit 4 may set a section in which the interval between adjacent utterances is a predetermined time or more as a holding section.

保留関連発話抽出装置及び方法は、コンピュータによって実現することができる。この場合、この装置の各部の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、この装置における各部が、この方法における各ステップがコンピュータ上で実現される。 The hold-related utterance extraction apparatus and method can be realized by a computer. In this case, the processing content of each part of this apparatus is described by a program. Then, by executing this program on a computer, each unit in this apparatus realizes each step in this method on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、これらの装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. In this embodiment, these apparatuses are configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

この発明は、上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 The present invention is not limited to the above-described embodiment, and can be modified as appropriate without departing from the spirit of the present invention.

１音声取得部
２音声特徴量抽出部
３音声認識部
４保留区間検出部
４１無音区間抽出部
４２定型表現抽出部
５抽出部 DESCRIPTION OF SYMBOLS 1 Voice acquisition part 2 Voice feature-value extraction part 3 Voice recognition part 4 Reservation area detection part 41 Silent area extraction part 42 Standard expression extraction part 5 Extraction part

Claims

An audio feature extraction step for extracting an audio feature of the audio signal;
A speech recognition step of performing speech recognition on the speech signal using the speech feature, an acoustic model, and a language model, detecting speech included in the speech signal, and generating information about the detected speech;
Using the information about the utterance, a holding section detecting step for detecting a holding section in which the interval between adjacent utterances is a predetermined time or more;
An extraction step of extracting a larger number of utterances as the holding section is longer from a set of utterances adjacent to the holding section;
Hold related utterance extraction method.

In the holding related utterance extraction method according to claim 1,
The holding section detecting step uses information about the utterance, and an interval between adjacent utterances is a predetermined time or more, and at least one of the adjacent utterances includes a typical phrase used at the time of holding. A step of detecting a holding interval;
An on-hold related utterance extraction method.

In the holding related utterance extraction method according to claim 1 or 2,
The above extracted utterances are ┌K × t┐ utterances, where K is a predetermined constant, t is the length of the holding section, and ┌ · ┐ is the smallest integer.
An on-hold related utterance extraction method.

In the holding related utterance extraction method according to claim 1 or 2,
The extracted utterance is a utterance included in the time length within K ′ × t time adjacent to the reserved section, where K ′ is a predetermined constant and t is the length of the reserved section.
An on-hold related utterance extraction method.

In the holding related utterance extraction method according to any one of claims 1 to 4,
The set of utterances adjacent to the reserved section is a set of utterances immediately before the reserved section.
An on-hold related utterance extraction method.

In the hold related utterance extraction device according to any one of claims 1 to 4,
The set of utterances adjacent to the holding section is a set of utterances immediately after the holding section.
An on-hold related utterance extraction method.

A voice feature amount extraction unit that extracts a voice feature amount of a voice signal;
A speech recognition unit that performs speech recognition on the speech signal using the speech feature, an acoustic model, and a language model, detects an utterance included in the speech signal, and generates information about the detected utterance;
Using the information about the utterance, a holding section detecting unit that detects a holding section in which the interval between adjacent utterances is equal to or longer than a predetermined time; and
An extraction unit that extracts a larger number of utterances as the holding section is longer from a set of utterances adjacent to the holding section;
Hold-related utterance extraction device.

In the hold related utterance extraction device according to claim 7,
The holding section detection unit uses the information about the utterance, the interval between adjacent utterances is equal to or longer than a predetermined time, and at least one of the adjacent utterances includes a typical phrase used at the time of holding. Detect pending intervals,
A hold-related utterance extraction device characterized by the above.

A holding related utterance extraction program for causing a computer to execute each step of the holding related utterance extracting method according to claim 1.