JP6784255B2

JP6784255B2 - Speech processor, audio processor, audio processing method, and program

Info

Publication number: JP6784255B2
Application number: JP2017507495A
Authority: JP
Inventors: 孝文越仲; 鈴木　隆之; 隆之鈴木
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2015-03-25
Filing date: 2016-03-18
Publication date: 2020-11-11
Anticipated expiration: 2036-03-18
Also published as: JPWO2016152132A1; WO2016152132A1

Description

本発明は音声データから頻出パターンを抽出する音声処理装置、音声処理システム、音声処理方法、およびプログラムに関する。
The present invention relates to a voice processing device, a voice processing system, a voice processing method, and a program for extracting a frequent pattern from voice data.

近年、警察の犯罪捜査では法科学に基づく科学的手法が広く用いられている。その代表例である指紋鑑定では、犯罪の現場で採取された指紋画像を大量の既知の指紋画像と順次比較して、犯罪に関与した人物が誰なのかを推定する。指紋鑑定に類する手法で、音声を扱うものを声紋鑑定あるいは音声鑑定と呼ぶ。 In recent years, forensic science-based scientific methods have been widely used in police criminal investigations. In the fingerprint test, which is a typical example, the fingerprint images collected at the crime scene are sequentially compared with a large number of known fingerprint images to estimate who was involved in the crime. A method similar to fingerprint test that handles voice is called voiceprint test or voice test.

特許文献１には、音声データから音声認識辞書に登録するキーワードの候補となる未知語の音声データを抽出する技術が記載されている。特許文献１に記載の技術は、音声データの音声のパワー値が閾値ｔｈ１より大きい状態が一定時間以上連続する区間を発話区間として検出し、各発話区間から、パワー値が閾値ｔｈ２より大きい状態が一定時間以上連続する区間ごとに分割する。そして、特許文献１に記載の技術は、この分割した音声データから音素列を取得し、クラスタリングを行い、評価値を算出して未知語を検出し、辞書に登録する。 Patent Document 1 describes a technique for extracting voice data of an unknown word that is a candidate for a keyword to be registered in a voice recognition dictionary from voice data. The technique described in Patent Document 1 detects a section in which a state in which the voice power value of voice data is greater than the threshold th1 is continuous for a certain period of time or longer, and a state in which the power value is greater than the threshold th2 is detected from each utterance section. Divide into sections that are continuous for a certain period of time or longer. Then, the technique described in Patent Document 1 acquires a phoneme sequence from the divided speech data, performs clustering, calculates an evaluation value, detects an unknown word, and registers it in a dictionary.

特許文献２には、誤認識となる要因を判定して利用者に通知する技術が記載されている。特許文献２に記載の技術は、特徴抽出部によって抽出されたメルケプストラム係数（Ｍｅｌ−ＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔｓ；以降「ＭＦＣＣ」と記載）ベクトル列を標準モデルの集合を用いて音素毎のセグメントに分割する。そして、特許文献２に記載の技術は、誤認識となった要因を調べ、分析結果に従い、利用者へ提示するメッセージの文字列を作成し、メッセージをディスプレイに表示することで利用者に通知する。 Patent Document 2 describes a technique for determining a factor that causes misrecognition and notifying the user. The technique described in Patent Document 2 divides a vector sequence of mel-cepstrum coefficients (hereinafter referred to as "MFCC") extracted by a feature extraction unit into segments for each phonetic element using a set of standard models. .. Then, the technique described in Patent Document 2 investigates the cause of the misrecognition, creates a character string of a message to be presented to the user according to the analysis result, and notifies the user by displaying the message on the display. ..

国際公開第２００９／１３６４４０号International Publication No. 2009/136440 特開２００４−３２５６３５号公報Japanese Unexamined Patent Publication No. 2004-325635

Ｍ．Ｋｏｔｔｉ，Ｖ．Ｍｏｓｃｈｏｕ，ａｎｄＣ．Ｋｏｔｒｏｐｏｕｌｏｓ， ”Ｓｐｅａｋｅｒｓｅｇｍｅｎｔａｔｉｏｎａｎｄｃｌｕｓｔｅｒｉｎｇ，” ＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ，Ｖｏｌ．８８，Ｎｏ．５，ｐｐ．１０９１−１１２４，Ｍａｙ２００８．M. Kotti, V.I. Moschou, and C.I. Kotropoulos, "Speaker segmentation and clustering," Signal Processing, Vol. 88, No. 5, pp. 1091-1124, May 2008. Ｔ．Ａｎａｓｔａｓａｋｏｓ，Ｊ．ＭｃＤｏｎｏｕｇｈ，Ｊ．Ｍａｋｈｏｕｌ， ”Ｓｐｅａｋｅｒａｄａｐｔｉｖｅｔｒａｉｎｉｎｇ：ａｍａｘｉｍｕｍｌｉｋｅｌｉｈｏｏｄａｐｐｒｏａｃｈｔｏｓｐｅａｋｅｒｎｏｒｍａｌｉｚａｔｉｏｎ，” ＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ，ＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，Ｓｐｅｅｃｈ，ａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ（ＩＣＡＳＳＰ），ｖｏｌ．２，ｐｐ．１０４３−１０４６，Ａｐｒ．１９９７．T. Anastasakos, J. Mol. McDonough, J. Mol. Makhoul, "Speaker adaptive training: a maximum likelihood approach to speaker normalization," Signal Processing, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, pp. 1043-1046, Apr. 1997. Ｍ．Ｊ．Ｆ．Ｇａｌｅｓ， ”ＣｌｕｓｔｅｒａｄａｐｔｉｖｅｔｒａｉｎｉｎｇｏｆｈｉｄｄｅｎＭａｒｋｏｖｍｏｄｅｌｓ，” ＩＥＥＥＴｒａｎｓ．ｏｎＳｐｅｅｃｈａｎｄＡｕｄｉｏＰｒｏｃｅｓｓｉｎｇ，Ｖｏｌ．８，Ｎｏ．４，ｐｐ．４１７−４２８，Ｊｕｌ．２０００．M. J. F. Gales, "Cruster adaptive training of hidden Markov models," IEEE Trans. on Speech and Audio Processing, Vol. 8, No. 4, pp. 417-428, Jul. 2000.

しかしながら、特許文献１に記載の技術では、キーワードの候補となる未知語を選定できるが、センテンスを含むフレーズ（例えば、「身代金を用意しろ。」といった文章）は選定できない。特許文献２に記載の技術では、誤認識となるセグメントごとのベクトル列を分析できるが、所望のフレーズを選定できない。すなわち、特許文献１および２に記載の技術では、所望のフレーズを選定できないという問題がある。 However, with the technique described in Patent Document 1, an unknown word that is a candidate for a keyword can be selected, but a phrase containing a sentence (for example, a sentence such as "Prepare a ransom") cannot be selected. The technique described in Patent Document 2 can analyze a vector sequence for each segment that causes erroneous recognition, but cannot select a desired phrase. That is, the techniques described in Patent Documents 1 and 2 have a problem that a desired phrase cannot be selected.

本発明の目的は、上記の問題を解決し、所望のフレーズを選定できる音声処理装置、音声処理システム、音声処理方法、およびプログラムを提供することにある。
An object of the present invention is to provide a voice processing device, a voice processing system, a voice processing method, and a program capable of solving the above problems and selecting a desired phrase.

本発明の一態様に係る音声処理装置は、音声データから、隣接するセグメントが少なくとも一部重複するように複数のセグメントを生成する第１の生成手段と、前記複数のセグメントを音韻の類似性に基づき分類してクラスタを生成する第２の生成手段と、前記クラスタのサイズに基づいて、所定の条件を満たすクラスタを選択する選択手段と、前記選択されたクラスタに含まれるセグメントを抽出する抽出手段とを備える。 The voice processing apparatus according to one aspect of the present invention has a first generation means for generating a plurality of segments so that adjacent segments overlap at least partially from voice data, and the plurality of segments have a phonological similarity. A second generation means for classifying based on and generating clusters, a selection means for selecting clusters satisfying a predetermined condition based on the size of the cluster, and an extraction means for extracting segments included in the selected cluster. And.

本発明の一態様に係る音声処理方法は、音声データから、隣接するセグメントが少なくとも一部重複する複数のセグメントを生成し、前記複数のセグメントを音韻の類似性に基づき分類してクラスタを生成し、前記クラスタのサイズに基づいて、所定の条件を満たすクラスタを選択し、前記選択されたクラスタに含まれるセグメントを抽出する。 In the speech processing method according to one aspect of the present invention, a plurality of segments in which adjacent segments overlap at least partially are generated from the speech data, and the plurality of segments are classified based on phonological similarity to generate a cluster. , A cluster satisfying a predetermined condition is selected based on the size of the cluster, and the segments included in the selected cluster are extracted.

本発明の一態様に係るプログラムは、音声データから、隣接するセグメントが少なくとも一部重複する複数のセグメントを生成する処理と、前記複数のセグメントを音韻の類似性に基づき分類してクラスタを生成する処理と、前記クラスタのサイズに基づいて、所定の条件を満たすクラスタを選択する処理と、前記選択されたクラスタに含まれるセグメントを抽出する処理とをコンピュータに実行させる。
The program according to one aspect of the present invention generates clusters by generating a plurality of segments in which adjacent segments overlap at least partially from audio data and classifying the plurality of segments based on phonological similarity. and processing, based on the size of the clusters, a process of selecting a predetermined condition is satisfied cluster, Ru to execute the processing for extracting the segments included in the selected cluster to the computer.

本発明は、音声処理装置、音声処理システム、音声処理方法、およびプログラムにおいて、所望のフレーズを選定できるという効果がある。 The present invention has an effect that a desired phrase can be selected in a voice processing device, a voice processing system, a voice processing method, and a program.

本発明の第１の実施形態に係る音声処理装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the voice processing apparatus which concerns on 1st Embodiment of this invention. 本発明の各実施形態および具体例におけるコンピュータの構成例を示す概略ブロック図である。It is a schematic block diagram which shows the structural example of the computer in each embodiment and specific example of this invention. 本発明の第１の実施形態に係る音声処理装置の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the voice processing apparatus which concerns on 1st Embodiment of this invention. 本発明の第１の実施形態に係る音声処理装置がＨＭＭを用いてフレーズを抽出する方法の一例を示す図である。It is a figure which shows an example of the method which the voice processing apparatus which concerns on 1st Embodiment of this invention extracts a phrase using HMM. 本発明の第２の実施形態に係る音声処理装置の構成例を示すプロック図である。It is a block diagram which shows the structural example of the voice processing apparatus which concerns on 2nd Embodiment of this invention. 本発明の第２の実施形態に係る音声処理装置の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the voice processing apparatus which concerns on 2nd Embodiment of this invention. 本発明の第３の実施形態に係る音声処理装置の構成例を示すプロック図である。It is a block diagram which shows the structural example of the voice processing apparatus which concerns on 3rd Embodiment of this invention. 本発明の第３の実施形態に係る音声処理装置の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the voice processing apparatus which concerns on 3rd Embodiment of this invention. 本発明の第４の実施形態に係る音声処理システムの構成例を示すブロック図である。It is a block diagram which shows the structural example of the voice processing system which concerns on 4th Embodiment of this invention. 本発明の具体例における外部記憶装置が記憶する音声データの一例を示す図である。It is a figure which shows an example of the voice data which the external storage device stores in the specific example of this invention. 本発明の具体例における生成部が音声データを分割する方法の一例を示す図である。It is a figure which shows an example of the method which the generation part in the specific example of this invention divides voice data. 本発明の具体例におけるクラスタリング部が、複数のセグメントをまとめたクラスタを生成する方法の一例を示す図である。It is a figure which shows an example of the method which the clustering part in the specific example of this invention generates a cluster which put together a plurality of segments.

まず、本発明の実施形態の理解を容易にするために、本発明の背景を説明する。 First, the background of the present invention will be described in order to facilitate understanding of the embodiments of the present invention.

本発明に関連する音声鑑定方法では、例えば誘拐犯からの身代金要求やテロリストの犯行予告の電話を録音し、録音した音声を既知の音声と比較して、電話の声の主の特定を試みる。 In the voice appraisal method related to the present invention, for example, a phone call for a ransom request from a kidnapper or a terrorist's crime notice is recorded, and the recorded voice is compared with a known voice to try to identify the main voice of the phone.

生涯不変な指紋と違い、音声は話される内容によって都度変化する。したがって音声鑑定方法では、同じ内容が話された音声の一部分（区間）を切り出し、比較する。音声鑑定方法では、例えば、誘拐犯の身代金要求では、「金を用意しろ」というフレーズがしばしば出現することが期待されるため、このようなフレーズを発見して切り出し、同じく「金を用意しろ」と話された音声と比較する。 Unlike fingerprints, which are lifelong, voice changes each time it is spoken. Therefore, in the voice appraisal method, a part (section) of the voice in which the same content is spoken is cut out and compared. In the voice appraisal method, for example, in the kidnapper's ransom request, it is expected that the phrase "prepare money" will often appear, so find and cut out such a phrase and also "prepare money". Compare with the voice spoken.

どのようなフレーズを用いるかは、ケースバイケースである。誘拐犯の場合は上述の「金を用意しろ」などが頻繁に出現するので適当と考えられる。振り込め詐欺犯の場合も金にまつわるフレーズが適当であろうが、誘拐犯の場合とはおそらく異なる。またテロリストの場合にはよりよい別のフレーズがあるであろうし、軍やその他政府機関の諜報活動でも異なるフレーズを用いた方がよいであろう。このような頻繁に出現するフレーズの選定は、これまで熟練した鑑定者の経験と勘に頼ってきた。しかしながら、そのような場合、熟練した鑑定者が時間をかけて注意深く音声を観察する必要があり、音声鑑定に必要な所望のフレーズを得ようとすると、大きな人的コストがかかる等の問題がある。 What phrase to use is on a case-by-case basis. In the case of a kidnapper, the above-mentioned "Prepare money" appears frequently, so it is considered appropriate. The phrase about money would be appropriate for wire fraud, but it's probably not the case for kidnappers. Also, in the case of terrorists, there will be better alternative phrases, and it would be better to use different phrases in military and other government intelligence activities. The selection of such frequently occurring phrases has relied on the experience and intuition of skilled appraisers. However, in such a case, it is necessary for a skilled appraiser to carefully observe the voice over time, and there is a problem that a large human cost is required to obtain the desired phrase necessary for the voice appraisal. ..

以下で説明される本発明の実施形態によれば、上述の問題等が解決され、所望のフレーズを選定することができる。 According to the embodiment of the present invention described below, the above-mentioned problems and the like can be solved, and a desired phrase can be selected.

以下、本発明の実施形態および具体例について図面を参照して説明する。尚、各実施形態および具体例について、同様な構成要素には同じ符号を付し、適宜説明を省略する。 Hereinafter, embodiments and specific examples of the present invention will be described with reference to the drawings. In addition, about each embodiment and a specific example, the same components are designated by the same reference numerals, and description thereof will be omitted as appropriate.

［第１の実施の形態］
以下、本発明を実施するための第１の形態（以降、「第１の実施形態」と記載）について図面を参照して詳細に説明する。[First Embodiment]
Hereinafter, a first embodiment for carrying out the present invention (hereinafter, referred to as “first embodiment”) will be described in detail with reference to the drawings.

［構成の説明］
図１は、本発明の第１の実施形態における音声処理装置１０の構成例を示すブロック図である。図１を参照すると、本発明の第１の実施形態における音声処理装置１０は、生成部１１、クラスタリング部１２、選択部１３、および抽出部１４を備える。ここで、生成部１１は、第１の生成部とも記載する。クラスタリング部１２は、第２の生成部とも記載する。[Description of configuration]
FIG. 1 is a block diagram showing a configuration example of the voice processing device 10 according to the first embodiment of the present invention. Referring to FIG. 1, the voice processing apparatus 10 according to the first embodiment of the present invention includes a generation unit 11, a clustering unit 12, a selection unit 13, and an extraction unit 14. Here, the generation unit 11 is also described as a first generation unit. The clustering unit 12 is also described as a second generation unit.

生成部１１は、外部記憶装置が記憶する音声データから、隣接するセグメントの少なくとも一部が重複する、複数のセグメントを生成する。生成部１１は、例えば、外部記憶装置が記憶する音声データを、短い時間単位に細分化して、該細分化した音声データを用いて複数のセグメントを生成する。また、生成部１１が生成する複数のセグメントの時間長は一定の時間長であってもよい。また、生成部１１は、一つの音声データに対して異なる時間長で複数回の分割を行い、この分割した音声データを用いて、種々の時間長のセグメントを生成してもよい。 The generation unit 11 generates a plurality of segments in which at least a part of adjacent segments overlaps from the voice data stored in the external storage device. For example, the generation unit 11 subdivides the voice data stored in the external storage device into short time units, and generates a plurality of segments using the subdivided voice data. Further, the time length of the plurality of segments generated by the generation unit 11 may be a constant time length. Further, the generation unit 11 may divide one voice data a plurality of times with different time lengths, and generate segments having various time lengths by using the divided voice data.

クラスタリング部１２は、所定の類似度指標に基づき複数のセグメントを分類してクラスタを生成する。 The clustering unit 12 classifies a plurality of segments based on a predetermined similarity index to generate clusters.

選択部１３は、生成されたクラスタの中から、各クラスタのサイズに基づき少なくとも１つのクラスタを選択する。抽出部１４は、選択されたクラスタに含まれるセグメントを抽出する。ここで、クラスタのサイズとは、例えば、セグメントの総時間長、セグメントの内容（フレーズとも呼ぶ）の出現回数とセグメントの長さとの掛け算で得られる結果等である。 The selection unit 13 selects at least one cluster from the generated clusters based on the size of each cluster. The extraction unit 14 extracts the segments included in the selected cluster. Here, the size of the cluster is, for example, the total time length of the segment, the result obtained by multiplying the number of occurrences of the content of the segment (also referred to as a phrase) and the length of the segment, and the like.

図２は、本発明の各実施形態および具体例におけるコンピュータ１０００の構成例を示す概略ブロック図である。コンピュータ１０００は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１００１と、主記憶装置１００２と、補助記憶装置１００３と、インターフェース１００４と、入力デバイス１００５と、ディスプレイ装置１００６とを備える。 FIG. 2 is a schematic block diagram showing a configuration example of the computer 1000 in each embodiment and specific example of the present invention. The computer 1000 includes a CPU (Central Processing Unit) 1001, a main storage device 1002, an auxiliary storage device 1003, an interface 1004, an input device 1005, and a display device 1006.

各実施形態および具体例の音声処理装置１０等は、コンピュータ１０００に実装される。音声処理装置１０等の動作は、プログラムの形式で補助記憶装置１００３に記憶されている。ＣＰＵ１００１は、プログラムを補助記憶装置１００３から読み出して主記憶装置１００２に展開し、そのプログラムに従って上記の処理を実行する。例えばＣＰＵ１００１は、上記プログラムを補助記憶装置１００３から読み出して主記憶装置１００２に展開することで、生成部１１、クラスタリング部１２、選択部１３、および抽出部１４の各部の機能を実現する。 The voice processing device 10 and the like of each embodiment and the specific example are mounted on the computer 1000. The operation of the voice processing device 10 and the like is stored in the auxiliary storage device 1003 in the form of a program. The CPU 1001 reads a program from the auxiliary storage device 1003, expands it into the main storage device 1002, and executes the above processing according to the program. For example, the CPU 1001 reads the program from the auxiliary storage device 1003 and expands it to the main storage device 1002 to realize the functions of the generation unit 11, the clustering unit 12, the selection unit 13, and the extraction unit 14.

補助記憶装置１００３は、一時的でない有形の媒体の一例である。一時的でない有形の媒体の他の例として、インターフェース１００４を介して接続される磁気ディスク、光磁気ディスク、ＣＤ（ＣｏｍｐａｃｔＤｉｓｃ）−ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）−ＲＯＭ、半導体メモリ等が挙げられる。また、このプログラムが通信回線によってコンピュータ１０００に配信される場合、配信を受けたコンピュータ１０００がそのプログラムを主記憶装置１００２に展開し、上記の処理を実行しても良い。 Auxiliary storage 1003 is an example of a non-temporary tangible medium. Other examples of non-temporary tangible media include magnetic disks, magneto-optical disks, CD (Compact Disk) -ROM (Read Only Memory), DVD (Digital Versailles Disk) -ROM, semiconductors connected via interface 1004. Memory and the like can be mentioned. When this program is distributed to the computer 1000 via a communication line, the distributed computer 1000 may expand the program to the main storage device 1002 and execute the above processing.

インターフェース１００４は、ＣＰＵ１００１に接続され、ネットワークあるいは外部記憶媒体に接続される。外部データがインターフェース１００４を介してＣＰＵ１００１に取り込まれても良い。入力デバイス１００５は、例えばキーボード、マウス、タッチパネル、又はマイクである。ディスプレイ装置１００６は、例えばＬＣＤ（ＬｉｑｕｉｄＣｒｙｓｔａｌＤｉｓｐｌａｙ）やＣＲＴ（ＣａｔｈｏｄｅＲａｙＴｕｂｅ）ディスプレイのような、ＣＰＵ１００１やＧＰＵ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）（図示せず）等により処理された描画データに対応する画面を表示する装置である。なお、図２に示すハードウェア構成は、一例にすぎず、図２が示す各部それぞれが独立した論理回路で構成されていても良い。 The interface 1004 is connected to the CPU 1001 and is connected to a network or an external storage medium. External data may be taken into the CPU 1001 via the interface 1004. The input device 1005 is, for example, a keyboard, a mouse, a touch panel, or a microphone. The display device 1006 displays a screen corresponding to drawing data processed by a CPU 1001 or a GPU (Graphics Processing Unit) (not shown), such as an LCD (Liquid Crystal Display) or a CRT (Cathode Ray Tube) display. It is a device to do. The hardware configuration shown in FIG. 2 is only an example, and each part shown in FIG. 2 may be configured by an independent logic circuit.

また、プログラムは、前述の処理の一部を実現するためのものであっても良い。さらに、プログラムは、補助記憶装置１００３に既に記憶されている他のプログラムとの組み合わせで前述の処理を実現する差分プログラムであっても良い。 Further, the program may be for realizing a part of the above-mentioned processing. Further, the program may be a difference program that realizes the above-mentioned processing in combination with another program already stored in the auxiliary storage device 1003.

［動作の説明］
図３を用いて、本実施形態の動作について説明する。図３は、本発明の第１の実施形態における音声処理装置１０の動作例を示すフローチャートである。[Explanation of operation]
The operation of this embodiment will be described with reference to FIG. FIG. 3 is a flowchart showing an operation example of the voice processing device 10 according to the first embodiment of the present invention.

生成部１１は、外部記憶装置が記憶する音声データから複数のセグメントを生成する（ステップＳ１０１）。このとき、生成部１１は、隣接するセグメントが、少なくとも時間的に重なりを持つように、複数のセグメントを生成する。セグメントの時間長は、想定するフレーズの時間長に応じて、例えば１秒から数秒の範囲の一定値としてもよい。 The generation unit 11 generates a plurality of segments from the voice data stored in the external storage device (step S101). At this time, the generation unit 11 generates a plurality of segments so that the adjacent segments overlap at least in time. The time length of the segment may be a constant value in the range of, for example, 1 second to several seconds, depending on the time length of the assumed phrase.

また、生成部１１は、一つの音声データに対して異なる時間長で複数回の分割を行い、種々の時間長のセグメントを生成してもよい。また、生成部１１は、非特許文献１に記載された方法などを用いて所定の変化点や無音区間などにおいて音声データを分割し、分割した複数の音声データを用いて、可変長のセグメントを生成してもよい。 Further, the generation unit 11 may divide one voice data a plurality of times with different time lengths to generate segments having various time lengths. Further, the generation unit 11 divides the voice data at a predetermined change point, a silent section, or the like by using the method described in Non-Patent Document 1, and uses the divided plurality of voice data to generate a variable length segment. It may be generated.

クラスタリング部１２は、所定の類似度指標に基づき、複数のセグメントを分類してクラスタを生成する（ステップＳ１０２）。すなわち、クラスタリング部１２は、複数のセグメントをクラスタリングする。クラスタリング部１２は、生成部１１が生成した複数のセグメントから、各セグメント間の類似度を計算し、類似度の高いセグメント同士をまとめたクラスタを生成する。クラスタリング部１２による類似度指標やクラスタ生成の具体的な方法については、例えば非特許文献１に記載の方法を用いることができる。 The clustering unit 12 classifies a plurality of segments based on a predetermined similarity index to generate a cluster (step S102). That is, the clustering unit 12 clusters a plurality of segments. The clustering unit 12 calculates the similarity between each segment from the plurality of segments generated by the generation unit 11, and generates a cluster in which the segments having a high degree of similarity are grouped together. As a specific method of the similarity index and cluster generation by the clustering unit 12, for example, the method described in Non-Patent Document 1 can be used.

ここで、類似度指標とは、セグメントを構成する音韻の類似性を測る指標である。類似度指標は、例えば、音響特徴量系列の平均と分散から計算されるバタチャリャ距離、カルバック・ライブラのダイバージェンス、対数尤度比など、音響特徴量の統計量を用いる指標である。これらの指標は、セグメント内の音響特徴量系列の順序を考慮しない。 Here, the similarity index is an index for measuring the similarity of the phonemes constituting the segment. The similarity index is an index using statistics of acoustic features such as the Batacharya distance calculated from the mean and variance of the acoustic features series, the divergence of Calvac Libra, and the log-likelihood ratio. These indicators do not consider the order of the acoustic feature series within the segment.

また、類似度指標を用いる方法は、例えば、順序、すなわち時刻順を考慮する指標を用いてもよい。類似度指標を用いる方法は、例えば、セグメント間で各音響特徴量の最適な対応関係を動的計画法（ＤｙｎａｍｉｃＰｒｏｇｒａｍｍｉｎｇ；以降「ＤＰ」と記載）で求めて類似度を計算するＤＰマッチング法がある。ここで、音響特徴量とは、例えば、ＭＦＣＣである。ＭＦＣＣは音声認識などで広く用いられている。 Further, as a method using the similarity index, for example, an index considering the order, that is, the time order may be used. A method using the similarity index is, for example, a DP matching method in which the optimum correspondence between each acoustic feature is obtained by a dynamic programming method (hereinafter referred to as "DP") and the similarity is calculated. is there. Here, the acoustic feature amount is, for example, MFCC. MFCC is widely used in voice recognition and the like.

選択部１３は、クラスタリング部１２におけるクラスタリングが収束した場合（ステップＳ１０３でＹｅｓ）、クラスタリング部１２が生成したクラスタの中から、各クラスタのサイズに基づき、所定の条件を満たすクラスタを選択する（ステップＳ１０４）。選択部１３は、この選択において、頻出するフレーズを発見するという観点からクラスタのサイズを比較し、サイズの大きい順に、少なくとも１つのクラスタを選ぶ。所定の条件とは、より多くのセグメントを含む、セグメントの総時間長がより長い、等が挙げられる。つまり、選択部１３は、例えば、より多くのセグメントを含むクラスタ、あるいはセグメントの総時間長がより長いクラスタを所定の条件を満たすクラスタとして選ぶ。 When the clustering in the clustering unit 12 converges (Yes in step S103), the selection unit 13 selects a cluster that satisfies a predetermined condition from the clusters generated by the clustering unit 12 based on the size of each cluster (step). S104). In this selection, the selection unit 13 compares the sizes of clusters from the viewpoint of finding frequently occurring phrases, and selects at least one cluster in descending order of size. Predetermined conditions include a larger number of segments, a longer total time length of the segments, and the like. That is, the selection unit 13 selects, for example, a cluster including more segments or a cluster having a longer total time length of the segments as a cluster satisfying a predetermined condition.

ここで、クラスタリングが収束する場合とは、例えば、ステップＳ１０１及びステップＳ１０２が所定回数実行された状況、クラスタリングに関する所定の評価値の増減が一定の値以下になった状況等である。なお、クラスタリングが収束する場合とは、クラスタリングに関する所定の評価値の増減が一定の値以下になった状況に付随して、クラスタ間でセグメントの移動がなくなった状況であってもよい。 Here, the case where clustering converges is, for example, a situation in which steps S101 and S102 are executed a predetermined number of times, a situation in which an increase or decrease in a predetermined evaluation value related to clustering becomes a certain value or less. The case where the clustering converges may be a situation in which the movement of the segment between the clusters is stopped in association with the situation where the increase / decrease of the predetermined evaluation value related to the clustering becomes a certain value or less.

抽出部１４は、選択部１３で選択されたクラスタに含まれる１または複数のセグメントから、セグメントを抽出する（ステップＳ１０５）。これにより、抽出部１４は、音声データから、所望のフレーズに該当する部分のセグメントを抽出することができる。 The extraction unit 14 extracts a segment from one or a plurality of segments included in the cluster selected by the selection unit 13 (step S105). As a result, the extraction unit 14 can extract the segment of the portion corresponding to the desired phrase from the voice data.

ここで、クラスタリング部１２におけるクラスタリングが収束していない場合（ステップＳ１０３でＮｏ）、ステップＳ１０１の処理に戻る。これは、ステップＳ１０１およびステップＳ１０２が相互に依存するため、所定回数、あるいは収束するまで反復してもよいことを示す。 Here, if the clustering in the clustering unit 12 has not converged (No in step S103), the process returns to the process of step S101. This indicates that since steps S101 and S102 are interdependent, they may be repeated a predetermined number of times or until they converge.

なお、生成部１１とクラスタリング部１２とは、図４が示す構造を有する隠れマルコフモデル（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ；以降、「ＨＭＭ」と記載する）を用いて一括実行することも可能である。図４は、音声処理装置１０がＨＭＭを用いてフレーズを抽出する方法の一例を示す図である。すなわち、音声処理装置１０は、外部記憶装置が記憶する音声データを用いて、図４が示すようなＨＭＭを最尤推定法などに基づき学習する。これにより、第１のフレーズ（図４のフレーズ１）、第２のフレーズ（図４のフレーズ２）、…、第Ｎのフレーズ（図４のフレーズＮ）を表現する一方向型ＨＭＭ（Ｌｅｆｔ−ｔｏ−ｒｉｇｈｔＨＭＭ）が自動的に形成され、同時に各々に属するセグメントも獲得される。 The generation unit 11 and the clustering unit 12 can be collectively executed by using a hidden Markov model (hereinafter referred to as “HMM”) having the structure shown in FIG. FIG. 4 is a diagram showing an example of a method in which the voice processing device 10 extracts a phrase using the HMM. That is, the voice processing device 10 learns the HMM as shown in FIG. 4 based on the maximum likelihood estimation method or the like by using the voice data stored in the external storage device. As a result, the one-way HMM (Left-) expressing the first phrase (phrase 1 in FIG. 4), the second phrase (phrase 2 in FIG. 4), ..., the Nth phrase (phrase N in FIG. 4) To-right HMM) is automatically formed, and at the same time, the segments belonging to each are also acquired.

抽出された音声データの該当部分を聴取することにより、頻出するフレーズを確認し、また音声鑑定に利用することができる。 By listening to the relevant part of the extracted voice data, frequently occurring phrases can be confirmed and used for voice appraisal.

［効果の説明］
以上のように、本実施形態に係る音声処理装置１０よれば、生成部１１が音声データから、隣接するセグメントが少なくとも一部重複するように、複数のセグメントを生成し、クラスタリング部１２が音韻の類似性に基づき、複数のセグメントを分類してクラスタを生成する。そして、本実施形態に係る音声処理装置１０によれば、選択部１３がクラスタの中から、各クラスタのサイズに基づき少なくとも１つのクラスタを選択する。更に、本実施形態における音声処理装置１０によれば、抽出部１４が選択されたクラスタに含まれるセグメントを抽出するため、音声データの中から所望のフレーズに該当する部分のセグメントを抽出することが可能となる。その理由は、生成部１１が音声データから隣接するセグメントが少なくとも一部が重複するように複数のセグメントを生成しているため、単語よりも短い語から単語よりも長いフレーズを１つのセグメントとして生成できるからである。[Explanation of effect]
As described above, according to the voice processing device 10 according to the present embodiment, the generation unit 11 generates a plurality of segments from the voice data so that at least a part of the adjacent segments overlap, and the clustering unit 12 generates the phoneme. Create clusters by classifying multiple segments based on similarity. Then, according to the voice processing device 10 according to the present embodiment, the selection unit 13 selects at least one cluster from the clusters based on the size of each cluster. Further, according to the voice processing device 10 in the present embodiment, since the extraction unit 14 extracts the segment included in the selected cluster, it is possible to extract the segment corresponding to the desired phrase from the voice data. It will be possible. The reason is that since the generation unit 11 generates a plurality of segments from the voice data so that at least a part of the adjacent segments overlaps, a phrase shorter than the word to a phrase longer than the word is generated as one segment. Because it can be done.

また、本実施形態における音声処理装置１０を用いることで、熟練した鑑定者でなくとも音声鑑定に必要な頻出するフレーズを低コストで発見および選定できる。その理由は、音声処理装置１０が、与えられた音声データから、隣接するセグメントが少なくとも一部重複するようにセグメントを生成し、このセグメントをクラスタリングし、類似した多数のセグメントを含むクラスタを選択するからである。そして、音声処理装置１０はこのように選択されたクラスタに含まれるセグメントを抽出するからである。抽出されたセグメントは、生成部１１が生成したセグメントであり、音声データのうちの所望のフレーズを含む部分的な音声データである。これにより、音声処理装置１０は、音声データ中で頻出するフレーズを自動的に発見できるからである。 Further, by using the voice processing device 10 in the present embodiment, it is possible to find and select frequently-used phrases necessary for voice appraisal at low cost even if the expert is not a skilled appraiser. The reason is that the voice processing device 10 generates a segment from the given voice data so that at least a part of the adjacent segments overlaps, clusters this segment, and selects a cluster containing a large number of similar segments. Because. This is because the voice processing device 10 extracts the segments included in the cluster selected in this way. The extracted segment is a segment generated by the generation unit 11, and is partial voice data including a desired phrase in the voice data. This is because the voice processing device 10 can automatically find phrases that frequently appear in the voice data.

さらに、本実施形態における音声処理装置１０を用いることで、定量的に頻度の高いフレーズを発見できるため、音声鑑定に有用な頻出フレーズを高い信頼性で発見できる。 Further, by using the voice processing device 10 in the present embodiment, it is possible to quantitatively find frequently used phrases, so that frequently used phrases useful for voice identification can be found with high reliability.

［第２の実施の形態］
以下、本発明の第２の実施形態について図面を参照して詳細に説明する。[Second Embodiment]
Hereinafter, the second embodiment of the present invention will be described in detail with reference to the drawings.

［構成の説明］
図５は、本発明の第２の実施形態に係る音声処理装置２０の構成例を示すブロック図である。図５を参照すると、本発明の第２の実施形態に係る音声処理装置２０は、正規化学習部１５、音声データ正規化部１６、音声データ処理部１７、第１〜第Ｎの音声データ記憶部（１０１−１〜１０１−Ｎ（Ｎは正の整数））、不特定音響モデル記憶部１０２、及び第１〜第Ｎのパラメタ記憶部（１０３−１〜１０３−Ｎ（Ｎは正の整数））を備える。
ここで、正規化学習部１５は、第３の生成部とも記載する。なお、本実施の形態では、第１〜第Ｎの音声データ記憶部（１０１−１〜１０１−Ｎ）の夫々を区別しない場合、または、総称する場合には、音声データ記憶部１０１と呼ぶ。また、第１〜第Ｎのパラメタ記憶部（１０３−１〜１０３−Ｎ）の夫々を区別しない場合、または、総称する場合には、パラメタ記憶部１０３と呼ぶ。[Description of configuration]
FIG. 5 is a block diagram showing a configuration example of the voice processing device 20 according to the second embodiment of the present invention. Referring to FIG. 5, the voice processing device 20 according to the second embodiment of the present invention includes a normalization learning unit 15, a voice data normalization unit 16, a voice data processing unit 17, and first to Nth voice data storage. Unit (101-1-101-N (N is a positive integer)), unspecified acoustic model storage unit 102, and first to Nth parameter storage units (103-1-103-N (N is a positive integer). )) Is provided.
Here, the normalization learning unit 15 is also described as a third generation unit. In the present embodiment, when the first to Nth voice data storage units (101-1 to 101-N) are not distinguished from each other, or when they are generically referred to, they are referred to as voice data storage units 101. Further, when the first to first Nth parameter storage units (103-1 to 103-N) are not distinguished from each other, or when they are generically referred to, they are referred to as a parameter storage unit 103.

音声データ記憶手段１０１は、性質の異なる音声データを各々記憶する。すなわち、第１の音声データ記憶部１０１−１、第２の音声データ記憶部１０１−２、・・・、及び第Ｎの音声データ記憶部１０１−Ｎは各々性質の異なる音声データを記憶する。また、第１の音声データ記憶部１０１−１、第２の音声データ記憶部１０１−２、・・・、及び第Ｎの音声データ記憶部１０１−Ｎが記憶する各々性質の異なる音声データは、それぞれ音響的な性質に基づいて分類された音声データである。 The voice data storage means 101 stores voice data having different properties. That is, the first voice data storage unit 101-1, the second voice data storage unit 101-2, ..., And the Nth voice data storage unit 101-N store voice data having different properties. Further, the voice data having different properties stored in the first voice data storage unit 101-1, the second voice data storage unit 101-2, ..., And the Nth voice data storage unit 101-N are stored. These are audio data classified based on their acoustic properties.

不特定音響モデル記憶部１０２は、正規化学習部１５が学習した不特定音響モデルを記憶する。不特定音響モデルとは、音声データ記憶部１０１が記憶する性質の異なる音声データの差異を正規化することで得られるモデルである。 The unspecified acoustic model storage unit 102 stores the unspecified acoustic model learned by the normalized learning unit 15. The unspecified acoustic model is a model obtained by normalizing the difference in voice data having different properties stored in the voice data storage unit 101.

パラメタ記憶部１０３は、音声データの差異を正規化するためのパラメタを各々記憶する。すなわち、第１のパラメタ記憶部１０３−１、第２のパラメタ記憶部１０３−２、・・・、及び第Ｎのパラメタ記憶部１０３−Ｎは、音声データの差異を正規化するためのパラメタを各々記憶する。 The parameter storage unit 103 stores each of the parameters for normalizing the difference in the voice data. That is, the first parameter storage unit 103-1, the second parameter storage unit 103-2, ..., And the Nth parameter storage unit 103-N provide parameters for normalizing the difference in voice data. Remember each one.

正規化学習部１５は、音声データ記憶部１０１が記憶する性質の異なる音声データを用いて、正規化学習を行う。 The normalization learning unit 15 performs normalization learning using voice data having different properties stored in the voice data storage unit 101.

ここで正規化学習とは、例えば非特許文献２に記載された音響モデルの学習法である。
音響モデルは、音響特徴量の平均ベクトルμ_iによって各音韻ｉを規定するが、正規化学習では平均ベクトルが音声データの性質によって変わり得るとする。即ち、本実施の形態では、平均ベクトル（不特定音響モデル）μ_iを、以下の式（１）のようなアフィン変換（ａｆｆｉｎｅｔｒａｎｓｆｏｒｍａｔｉｏｎ）で表現する。Here, the normalization learning is, for example, a learning method of an acoustic model described in Non-Patent Document 2.
In the acoustic model, each tone i is defined by the average vector μ _i of the acoustic features, but in normalization learning, the average vector can change depending on the nature of the speech data. That is, in the present embodiment, the average vector (unspecified acoustic model) μ _i is expressed by an affine transformation as shown in the following equation (1).

ここで、ｓ＝１、２、・・・、Ｎである。また、Ａ_ｓおよびｂ_ｓは、夫々、音声データの性質の違いを正規化するためのパラメタである。Here, s = 1, 2, ..., N. Also, A _s and b _s are each an parameter for normalizing the difference in nature of the audio data.

式（１）により、音声データの性質の違いに影響されない不特定音響モデルμ_iと、音声データの性質の違いを正規化するためのパラメタＡ_Sおよびｂ_Sが得られる。そして、正規化学習部１５は、不特定音響モデルμ_iを不特定音響モデル記憶部１０２に格納する。また、正規化学習部１５は、パラメタＡ_Sおよびｂ_Sを、パラメタ記憶部１０３に記憶する。具体的には、正規化学習部１５は、パラメタＡ_１およびｂ_１を、第１のパラメタ記憶部１０３−１に格納し、パラメタＡ_Ｎおよびｂ_Ｎを第Ｎのパラメタ記憶部１０３−Ｎに格納する。非特許文献２には、話者によって音声データの性質が異なるとし、話者の違いを正規化する方法が記載されているが、音声データの性質の違いは話者に限らず、背景雑音、マイクや通信回線など、種々の想定が可能である。The equation (1), and unspecified acoustic model mu _i which is not affected by the difference in nature of the audio data, the parameters A _S and b _S for normalizing the difference in nature of the audio data is obtained. Then, the normalization learning unit 15 stores the unspecified acoustic model μ _i in the unspecified acoustic model storage unit 102. Further, the normalization learning unit 15, the parameter A _S and b _S, stored in the parameter storage unit 103. Specifically, the normalization learning unit 15 stores the parameters A ₁ and b ₁ in the first parameter storage unit 103-1 and stores the parameters A _N and b _N in the Nth parameter storage unit 103-N. Store. Non-Patent Document 2 describes a method of normalizing the difference between speakers, assuming that the properties of voice data differ depending on the speaker, but the difference in the properties of voice data is not limited to the speaker, and background noise, Various assumptions such as microphones and communication lines are possible.

すなわち、正規化学習部１５は、音声データ記憶部１０１が記憶する性質の異なる音声データの差異を正規化するための正規化パラメタを生成し、パラメタ記憶部１０３に記憶する。また、正規化学習部１５は、音声データ記憶部１０１が記憶する性質の異なる音声データの差異を正規化して不特定音響モデルを学習し、学習した不特定音響モデルを不特定音響モデル１０２に記憶する。ここで、正規化学習部１５は、音声データ記憶部１０１が記憶する性質の異なる音声データの差異を正規化するための正規化パラメタを推定することで、該正規化パラメタを生成する。また、正規化学習部１５は、例えば、反復計算を行う場合では、反復のたびに不特定音響モデルを不特定音響モデル１０２に記憶する。 That is, the normalization learning unit 15 generates a normalization parameter for normalizing the difference in the voice data having different properties stored in the voice data storage unit 101, and stores the normalization parameter in the parameter storage unit 103. Further, the normalization learning unit 15 learns the unspecified acoustic model by normalizing the difference in the voice data having different properties stored in the voice data storage unit 101, and stores the learned unspecified acoustic model in the unspecified acoustic model 102. To do. Here, the normalization learning unit 15 generates the normalization parameter by estimating a normalization parameter for normalizing the difference of the voice data having different properties stored in the voice data storage unit 101. Further, the normalization learning unit 15 stores the unspecified acoustic model in the unspecified acoustic model 102 for each iteration, for example, when performing an iterative calculation.

音声データ正規化部１６は、パラメタ記憶部１０３に記憶されたパラメタを参照し、各々音声データ記憶部１０１に記憶された音声データを正規化し、音声データ処理部１７に送る。具体的には、第ｓの音声データの音響特徴量の時系列ｘ_１、ｘ_２、・・・、ｘ_ｔ、・・・（ｔは正の整数）に対して、第ｓのパラメタを用い、式（１）の逆変換に相当する変換である、式（２）を施す。The voice data normalization unit 16 refers to the parameters stored in the parameter storage unit 103, normalizes the voice data stored in the voice data storage unit 101, and sends the voice data to the voice data processing unit 17. Specifically, the s parameter is used for the time series x ₁ , x ₂ , ..., X _t , ... (t is a positive integer) of the acoustic features of the s voice data. , Equation (2), which is a transformation corresponding to the inverse transformation of equation (1), is applied.

正規化を規定するパラメタは、音韻のクラス（摩擦音、破裂音など）に応じて異なるものを用いてもよいし、文脈依存性を考慮して前後の音韻に応じて異なるものを用いてもよい。また、音声データ正規化部１６は、音響特徴量の平均ベクトルだけでなく分散も正規化するようにしてもよい。またこれらに限らず、正規化学習に関して知られている各種の工夫を適用してよい。 The parameters that specify the normalization may be different depending on the phoneme class (fricative, plosive, etc.), or may be different depending on the preceding and following phonemes in consideration of context dependence. .. Further, the voice data normalization unit 16 may normalize not only the average vector of the acoustic features but also the variance. Further, not limited to these, various devices known for normalization learning may be applied.

音声データ処理部１７は、第１の実施形態における音声処理装置１０と同様の構成および効果を有する。すなわち、音声データ処理部１７は、図１が示す生成部１１、クラスタリング部１２、選択部１３、および抽出部１４の処理を第１の実施形態と同様に実行し、正規化された音声データ中に頻出するフレーズを含むセグメントを出力する。 The voice data processing unit 17 has the same configuration and effect as the voice processing device 10 in the first embodiment. That is, the voice data processing unit 17 executes the processing of the generation unit 11, the clustering unit 12, the selection unit 13, and the extraction unit 14 shown in FIG. 1 in the same manner as in the first embodiment, and is included in the normalized voice data. Outputs a segment containing phrases that frequently appear in.

［動作の説明］
図６を用いて、本実施形態の動作について説明する。図６は、本発明の第２の実施形態における音声処理装置２０の動作例を示すフローチャートである。ここで、図６が示すように、本実施形態における音声データ処理部１７の動作、すなわちステップＳ２０４からステップＳ２０８は、第１の実施形態における音声処理装置１０の動作、すなわちステップＳ１０１乃至ステップＳ１０５と同様であるため、説明を省略する。[Explanation of operation]
The operation of this embodiment will be described with reference to FIG. FIG. 6 is a flowchart showing an operation example of the voice processing device 20 according to the second embodiment of the present invention. Here, as shown in FIG. 6, the operation of the voice data processing unit 17 in the present embodiment, that is, steps S204 to S208 is the operation of the voice processing device 10 in the first embodiment, that is, steps S101 to S105. Since the same is true, the description thereof will be omitted.

正規化学習部１５は、音声データ記憶部１０１から各々音声データを読み出し、正規化学習を行って、各々の音声データの正規化パラメタをパラメタ記憶部１０３に記憶する（ステップＳ２０１）。 The normalization learning unit 15 reads each voice data from the voice data storage unit 101, performs normalization learning, and stores the normalization parameters of each voice data in the parameter storage unit 103 (step S201).

正規化学習部１５は、正規化を行って音声データの性質の差異を解消した上で生成した不特定音響モデルを不特定音響モデル記憶部１０２に記憶する（ステップＳ２０２）。 The normalization learning unit 15 stores the unspecified acoustic model generated after performing normalization to eliminate the difference in the properties of the voice data in the unspecified acoustic model storage unit 102 (step S202).

音声データ正規化部１６は、パラメタ記憶部１０３に記憶された正規化パラメタを参照し、それぞれ音声データ記憶部１０１に記憶された音声データを正規化する（ステップＳ２０３）。 The voice data normalization unit 16 refers to the normalization parameters stored in the parameter storage unit 103, and normalizes the voice data stored in the voice data storage unit 101 (step S203).

音声データ処理部１７は、図３が示す第１の実施形態における音声処理装置１０のステップＳ１０１乃至ステップＳ１０５と同様の処理を実行し、音声データ中に頻出するフレーズを含むセグメントを出力する（ステップＳ２０４乃至ステップＳ２０８）。 The voice data processing unit 17 executes the same processing as steps S101 to S105 of the voice processing device 10 in the first embodiment shown in FIG. 3, and outputs a segment including a phrase frequently appearing in the voice data (step). S204 to step S208).

［効果の説明］
以上のように、本実施形態における音声処理装置２０よれば、正規化学習部１５が音声データ記憶部１０１から各々音声データを読み出し、正規化学習を行って、各々の音声データの正規化パラメタをパラメタ記憶部１０３に記憶する。正規化学習部１５が正規化を行って各々の音声データの音響的な性質の差異を解消した上で生成した不特定音響モデルを不特定音響モデル記憶部１０２に記憶する。また、音声データ正規化部１６がパラメタ記憶部１０３に記憶された正規化パラメタを参照し、それぞれ音声データ記憶部１０１に記憶された音声データを正規化する。音声データ処理部１７が正規化された音声データ中に頻出するフレーズを含むセグメントを出力する。そのため、本実施形態における音声処理装置２０は、正規化されていない音声データを正規化し、所望のフレーズを選定することが可能である。[Explanation of effect]
As described above, according to the voice processing device 20 in the present embodiment, the normalization learning unit 15 reads each voice data from the voice data storage unit 101, performs normalization learning, and sets the normalization parameter of each voice data. It is stored in the parameter storage unit 103. The normalization learning unit 15 stores the unspecified acoustic model generated after normalizing and eliminating the difference in the acoustic properties of the respective voice data in the unspecified acoustic model storage unit 102. Further, the voice data normalization unit 16 refers to the normalization parameters stored in the parameter storage unit 103, and normalizes the voice data stored in the voice data storage unit 101, respectively. The voice data processing unit 17 outputs a segment including a phrase that frequently appears in the normalized voice data. Therefore, the voice processing device 20 in the present embodiment can normalize the non-normalized voice data and select a desired phrase.

また、本実施形態における音声処理装置２０によれば、正規化学習部１５が第１の音声データ、第２の音声データ、…、第Ｎの音声データの音響的な性質の違い（例えば話者の違い）を正規化する学習を行う。音声データ正規化部１６が音響的な性質の違いを解消した後に、音声データ処理部１７が音声データ中に頻出するフレーズを含むセグメントを抽出する。そのため、音声処理装置２０は、音声データ中に頻出するフレーズをより正確に抽出できる。理由としては、本実施形態における音声処理装置２０は、音声データ処理部１７の中のクラスタリング部１２が音声データの性質の違いに影響されて不適切なクラスタ（例えば話者のクラスタ）を生成するような事態を低減することができるからである。 Further, according to the voice processing device 20 in the present embodiment, the normalization learning unit 15 differs in the acoustic properties of the first voice data, the second voice data, ..., The Nth voice data (for example, the speaker). (Difference) is learned to normalize. After the voice data normalization unit 16 eliminates the difference in acoustic properties, the voice data processing unit 17 extracts a segment containing a phrase that frequently appears in the voice data. Therefore, the voice processing device 20 can more accurately extract phrases that frequently appear in the voice data. The reason is that in the voice processing device 20 of the present embodiment, the clustering unit 12 in the voice data processing unit 17 is affected by the difference in the properties of the voice data to generate an inappropriate cluster (for example, a cluster of speakers). This is because such a situation can be reduced.

［第３の実施の形態］
以下、本発明の第３の実施形態について図面を参照して詳細に説明する。[Third Embodiment]
Hereinafter, the third embodiment of the present invention will be described in detail with reference to the drawings.

［構成の説明］
図７は、本発明の第３の実施形態における音声処理装置３０の構成例を示すブロック図である。図７を参照すると、本発明の第３の実施形態における音声処理装置３０は、第２の実施形態における音声処理装置２０の構成に加え、未分類音声データ記憶部１０４と、音声データ分類部１８と、を備える。ここで、第２の実施形態における音声処理装置２０の構成は既に説明しているため、説明を省略する。また、音声データ分類部１８は、第４の生成部とも記載する。[Description of configuration]
FIG. 7 is a block diagram showing a configuration example of the voice processing device 30 according to the third embodiment of the present invention. Referring to FIG. 7, the voice processing device 30 according to the third embodiment of the present invention includes the unclassified voice data storage unit 104 and the voice data classification unit 18 in addition to the configuration of the voice processing device 20 according to the second embodiment. And. Here, since the configuration of the voice processing device 20 in the second embodiment has already been described, the description thereof will be omitted. The voice data classification unit 18 is also described as a fourth generation unit.

未分類音声データ記憶部１０４は、音声データを記憶する。 The unclassified voice data storage unit 104 stores voice data.

音声データ分類部１８は、音声データ記憶部１０４が記憶する音声データを音響的な性質に基づいて分類し、音声データ記憶部１０１に記憶する。音声データ分類部１８は、例えば、未分類音声データ記憶部１０４に記憶された音声データを音響的な性質の違い、例えば話者の違いに基づいてＮ個のクラスタに分類し、音声データ記憶部１０１に各々記憶する。すなわち、音声データ分類部１８は、未分類音声データ記憶部１０４に記憶された音声データを音響的な性質に基づいて分類することで、Ｎ個のクラスタを生成する。そして、音声データ分類部１８は、第１の音声データ記憶部に第１のクラスタを、第２の音声データ記憶部に第２のクラスタを、・・・、第Ｎの音声データ記憶部に第Ｎのクラスタを記憶する。 The voice data classification unit 18 classifies the voice data stored in the voice data storage unit 104 based on acoustic properties and stores it in the voice data storage unit 101. The voice data classification unit 18 classifies the voice data stored in the unclassified voice data storage unit 104 into N clusters based on the difference in acoustic properties, for example, the difference in speakers, and the voice data storage unit 18. Each is stored in 101. That is, the voice data classification unit 18 generates N clusters by classifying the voice data stored in the unclassified voice data storage unit 104 based on the acoustic properties. Then, the voice data classification unit 18 has a first cluster in the first voice data storage unit, a second cluster in the second voice data storage unit, ..., A second in the Nth voice data storage unit. Store N clusters.

ここで、未分類音声データ記憶部１０４に記憶された音声データは、種々の音響的な性質を有する音声データが混在したものであってよい。またＮはあらかじめ定められた定数としてもよいし、処理対象に応じて音声データ分類部１８が自動的に決定するようにしてもよい。これらは公知のクラスタリングの方法を適用することにより実施可能である。 Here, the voice data stored in the unclassified voice data storage unit 104 may be a mixture of voice data having various acoustic properties. Further, N may be a predetermined constant, or may be automatically determined by the voice data classification unit 18 according to the processing target. These can be carried out by applying a known clustering method.

音声データ記憶部１０１は、音声データ分類部１８によって分類された音声データを各々記憶する。 The voice data storage unit 101 stores each voice data classified by the voice data classification unit 18.

［動作の説明］
図８を用いて、本実施形態の動作について説明する。図８は、本発明の第３の実施形態における音声処理装置３０の動作例を示すフローチャートである。ここで、図８が示すように、本実施形態における音声データ処理部１７の動作、すなわちステップＳ３０６からステップＳ３１０は、第１の実施形態における音声処理装置１０の動作、すなわちステップＳ１０１乃至ステップＳ１０５と同様であるため、説明を省略する。[Explanation of operation]
The operation of this embodiment will be described with reference to FIG. FIG. 8 is a flowchart showing an operation example of the voice processing device 30 according to the third embodiment of the present invention. Here, as shown in FIG. 8, the operation of the voice data processing unit 17 in the present embodiment, that is, steps S306 to S310 is the operation of the voice processing device 10 in the first embodiment, that is, steps S101 to S105. Since the same is true, the description thereof will be omitted.

音声データ分類部１８は、音声データ記憶部１０４が記憶する音声データを音響的な性質に基づいて分類し、音声データ記憶部１０１に記憶する（ステップＳ３０１）。 The voice data classification unit 18 classifies the voice data stored in the voice data storage unit 104 based on the acoustic property and stores it in the voice data storage unit 101 (step S301).

正規化学習部１５は、音声データ記憶部１０１から各々音声データを読み出し、正規化学習を行って、各々の音声データの正規化パラメタをパラメタ記憶部１０３に記憶する（ステップＳ３０２）。 The normalization learning unit 15 reads each voice data from the voice data storage unit 101, performs normalization learning, and stores the normalization parameters of each voice data in the parameter storage unit 103 (step S302).

正規化学習部１５は、正規化を行って音声データの性質の差異を解消した上で生成した不特定音響モデルを不特定音響モデル記憶部１０２に記憶する（ステップＳ３０３）。 The normalization learning unit 15 stores the unspecified acoustic model generated after performing normalization to eliminate the difference in the properties of the voice data in the unspecified acoustic model storage unit 102 (step S303).

音声データ分類部１８および正規化学習部１５の結果が収束した場合（ステップＳ３０４でＹｅｓ）、音声データ正規化部１６は、パラメタ記憶部１０３に記憶された正規化パラメタを参照し、それぞれ音声データ記憶部１０１に記憶された音声データを正規化する（ステップＳ３０５）。 When the results of the voice data classification unit 18 and the normalization learning unit 15 converge (Yes in step S304), the voice data normalization unit 16 refers to the normalization parameters stored in the parameter storage unit 103, and each voice data. The audio data stored in the storage unit 101 is normalized (step S305).

ここで、音声データ分類部１８および正規化学習部１５の結果が収束していない場合（ステップＳ３０４でＮｏ）、ステップＳ３０１のフローへ戻る。これにより、音声データ分類部１８と正規化学習部１５は、結果が収束するまで交互に反復実行できる。 Here, when the results of the voice data classification unit 18 and the normalization learning unit 15 have not converged (No in step S304), the process returns to the flow of step S301. As a result, the voice data classification unit 18 and the normalization learning unit 15 can be repeatedly executed alternately until the results converge.

なお、音声データ分類部１８と正規化学習部１５が各々出力する結果は、相互に依存することもある。そのため、音声データ分類部１８と正規化学習部１５との実行回数が所定の閾値になるまでもしくは収束するまで、交互に実行する反復的な動作としてもよい。このような動作は、非特許文献３に記載される方法にならい、尤度最大化などの最適化基準に基づき効率的に実施することが可能である。 The results output by the voice data classification unit 18 and the normalization learning unit 15 may depend on each other. Therefore, the speech data classification unit 18 and the normalization learning unit 15 may be repeatedly executed alternately until the number of executions reaches a predetermined threshold value or converges. Such an operation can be efficiently carried out based on an optimization criterion such as maximizing the likelihood, following the method described in Non-Patent Document 3.

音声データ処理部１７は、図６が示す第１の実施形態における音声処理装置１０のステップＳ１０１乃至ステップＳ１０５と同様の処理を実行し、音声データ中に頻出するフレーズを含むセグメントを出力する（ステップＳ３０６乃至ステップＳ３１０）。 The voice data processing unit 17 executes the same processing as steps S101 to S105 of the voice processing device 10 in the first embodiment shown in FIG. 6 and outputs a segment including a phrase frequently appearing in the voice data (step). S306 to step S310).

［効果の説明］
以上のように、本実施形態における音声処理装置３０によれば、音声データ分類部１８が、音声データ記憶部１０４が記憶する音声データを音響的な性質に基づいて分類し、音声データ記憶部１０１に記憶する。そして、正規化学習部１５が、音声データ記憶部１０１から各々音声データを読み出し、正規化学習を行って、各々の音声データの正規化パラメタをパラメタ記憶部１０３に記憶する。正規化学習部１５が正規化を行って各々の音声データの音響的な性質の差異を解消した上で生成した不特定音響モデルを不特定音響モデル記憶部１０２に記憶する。音声データ正規化部１６がパラメタ記憶部１０３に記憶された正規化パラメタを参照し、それぞれ音声データ記憶部１０１に記憶された音声データを正規化する。音声データ処理部１７が正規化された音声データ中に頻出するフレーズを含むセグメントを出力する。そのため、本実施形態における音声処理装置３０は、分類および正規化されていない音声データを分類および正規化し、所望のフレーズを選定することが可能である。[Explanation of effect]
As described above, according to the voice processing device 30 in the present embodiment, the voice data classification unit 18 classifies the voice data stored in the voice data storage unit 104 based on the acoustic properties, and the voice data storage unit 101. Remember in. Then, the normalization learning unit 15 reads each voice data from the voice data storage unit 101, performs normalization learning, and stores the normalization parameters of each voice data in the parameter storage unit 103. The normalization learning unit 15 stores the unspecified acoustic model generated after normalizing and eliminating the difference in the acoustic properties of the respective voice data in the unspecified acoustic model storage unit 102. The voice data normalization unit 16 refers to the normalization parameters stored in the parameter storage unit 103, and normalizes the voice data stored in the voice data storage unit 101, respectively. The voice data processing unit 17 outputs a segment including a phrase that frequently appears in the normalized voice data. Therefore, the voice processing device 30 in the present embodiment can classify and normalize voice data that has not been classified and normalized, and select a desired phrase.

また、本実施形態における音声処理装置３０によれば、音声データ分類部１８が音声データを音響的な性質の違いに基づいてＮ個のクラスタに分類し、その結果を用いて正規化学習部１５が正規化学習を行うように構成されている。そのため、本実施形態における音声処理装置３０は、第２の実施形態における音声処理装置２０と比べて、音声データの準備コストを低減できる。その理由としては、本実施形態における音声処理装置３０は、音声データをあらかじめ音響的な性質の違いに応じて（例えば話者ごとに）分けておく必要がなく、雑多な音声データの集合を一括で与えて処理することができるからである。 Further, according to the voice processing device 30 in the present embodiment, the voice data classification unit 18 classifies the voice data into N clusters based on the difference in acoustic properties, and the normalization learning unit 15 uses the result. Is configured to perform normalization learning. Therefore, the voice processing device 30 in the present embodiment can reduce the preparation cost of voice data as compared with the voice processing device 20 in the second embodiment. The reason is that the voice processing device 30 in the present embodiment does not need to divide the voice data in advance according to the difference in acoustic properties (for example, for each speaker), and collectively collects a set of miscellaneous voice data. This is because it can be given and processed with.

［第４の実施の形態］
［構成の説明］
以下、本発明の第４の実施形態について図面を参照して詳細に説明する。[Fourth Embodiment]
[Description of configuration]
Hereinafter, the fourth embodiment of the present invention will be described in detail with reference to the drawings.

図９は、本発明の第４の実施形態における音声処理システム４０の構成例を示すブロック図である。図９を参照すると、第４の実施形態における音声処理システム４０は、音声処理装置４１と、音声入力装置４２と、指示入力装置４３と、出力装置４４とを備える。 FIG. 9 is a block diagram showing a configuration example of the voice processing system 40 according to the fourth embodiment of the present invention. Referring to FIG. 9, the voice processing system 40 according to the fourth embodiment includes a voice processing device 41, a voice input device 42, an instruction input device 43, and an output device 44.

音声処理装置４１は、入力された音声に対して本発明の第１の実施形態における音声処理装置１０の処理、第２の実施の形態における音声処理装置２０の処理、または、第３の実施形態における音声処理装置３０の処理（以降、「本発明の第１乃至第３の実施形態に記載のフレーズ抽出処理」と記載）を実行する。 The voice processing device 41 processes the input voice with respect to the voice processing device 10 according to the first embodiment of the present invention, the processing of the voice processing device 20 according to the second embodiment, or the third embodiment. (Hereinafter referred to as "phrase extraction process according to the first to third embodiments of the present invention") of the voice processing device 30 in the above.

音声入力装置４２は、音声を入力する。音声入力装置４２は、任意の音声データを音声処理装置４１に入力するインターフェースとして働く任意のデバイス、すなわち音声信号をデータとして収受するマイクや音声データを記録するメモリなどである。音声入力装置４２は、例えば、図２が示す入力デバイス１００５である。 The voice input device 42 inputs voice. The audio input device 42 is an arbitrary device that acts as an interface for inputting arbitrary audio data to the audio processing device 41, that is, a microphone that receives audio signals as data, a memory that records audio data, and the like. The voice input device 42 is, for example, the input device 1005 shown in FIG.

出力装置４４は、音声処理装置４１が処理を実行した結果を出力する。出力装置４４は、音声処理装置４２の処理結果を、操作者が指示入力装置４３から入力した指示に応じて視覚的あるいは聴覚的手段で出力する、モニターやスピーカーなどの出力デバイスである。出力装置４４の出力方法は、出力装置４４がモニターの場合、例えば、クラスタの一覧をサイズ順に表示する、特定のクラスタの内容を波形図、スペクトログラムなどにより表示する、複数のセグメントを比較できるように並べて表示する、などである。また、出力装置４４がスピーカーの場合、出力装置４４の出力方法は、音声を再生する、などである。出力装置４４は、例えば、ディスプレイ装置１００６で実現される。 The output device 44 outputs the result of the processing executed by the voice processing device 41. The output device 44 is an output device such as a monitor or a speaker that outputs the processing result of the voice processing device 42 by visual or auditory means according to the instruction input from the instruction input device 43 by the operator. The output method of the output device 44 is such that when the output device 44 is a monitor, for example, a list of clusters is displayed in order of size, the contents of a specific cluster are displayed by a waveform diagram, a spectrogram, etc., and a plurality of segments can be compared. Display them side by side, etc. When the output device 44 is a speaker, the output method of the output device 44 is to reproduce sound, and the like. The output device 44 is realized by, for example, the display device 1006.

指示入力装置４３は、操作者からの指示情報を受けて表示装置に表示する情報を制御する。指示入力装置４３は、出力装置４４が出力する情報に対する処理や音声処理装置４１の処理の実行など、操作者の指示情報を受け取るユーザインタフェースであり、マウスやキーボード、タッチパネルなどの任意の入力デバイスが利用可能である。 The instruction input device 43 receives instruction information from the operator and controls the information displayed on the display device. The instruction input device 43 is a user interface for receiving operator's instruction information such as processing for information output by the output device 44 and processing of the voice processing device 41, and any input device such as a mouse, keyboard, or touch panel can be used. It is available.

［動作の説明］
以下、本発明の第４の実施形態における音声処理システム４０の動作例について説明する。[Explanation of operation]
Hereinafter, an operation example of the voice processing system 40 according to the fourth embodiment of the present invention will be described.

指示入力装置４３は、操作者からの指示情報を受け取り、音声処理装置４１に処理を実行するよう制御する。音声入力装置４２は、任意の音声データを音声処理装置４１に入力する。音声処理装置４１は、入力された音声データに基づき、本発明の第１乃至第３の実施形態に記載のフレーズ抽出処理を実行し、頻繁に出現するフレーズを含んだクラスタを選択し、さらに選択されたクラスタに含まれるセグメントを抽出する。出力装置４４は、音声処理装置４１の処理結果を、操作者が指示入力装置４３から入力した指示に応じて視覚的あるいは聴覚的手段で出力する。つまり、出力装置４４は、操作者が閲覧したいと希望した形態で、処理結果を出力する。 The instruction input device 43 receives instruction information from the operator and controls the voice processing device 41 to execute the process. The voice input device 42 inputs arbitrary voice data to the voice processing device 41. The voice processing device 41 executes the phrase extraction process according to the first to third embodiments of the present invention based on the input voice data, selects a cluster containing frequently appearing phrases, and further selects the cluster. Extract the segments contained in the cluster. The output device 44 outputs the processing result of the voice processing device 41 by visual or auditory means according to the instruction input from the instruction input device 43 by the operator. That is, the output device 44 outputs the processing result in the form desired by the operator to view.

［効果の説明］
以上のように、本実施形態における音声処理システム４０によれば、指示入力装置４３が操作者から入力される指示情報に応じて、音声処理装置４１に処理を実行するよう制御する。音声入力装置４２が任意の音声データを音声処理装置４１に入力する。音声処理装置４２が入力された音声データに基づき、本発明の第１乃至第３の実施形態に記載のフレーズ抽出を実行し、頻繁に出現するフレーズ（セグメント）を含んだクラスタを選択し、さらに選択されたクラスタに含まれるセグメントを抽出する。出力装置４４が音声処理装置４１の処理結果を、操作者が指示入力装置４３から入力した指示に応じて視覚的あるいは聴覚的手段で出力する。そのため、本実施形態における音声処理システム４０は、音声データに含まれる頻繁に出現するフレーズを含むクラスタやセグメントを出力することが可能である。[Explanation of effect]
As described above, according to the voice processing system 40 in the present embodiment, the instruction input device 43 controls the voice processing device 41 to execute the process according to the instruction information input from the operator. The voice input device 42 inputs arbitrary voice data to the voice processing device 41. Based on the input voice data, the voice processing device 42 executes the phrase extraction according to the first to third embodiments of the present invention, selects a cluster containing frequently appearing phrases (segments), and further. Extract the segments contained in the selected cluster. The output device 44 outputs the processing result of the voice processing device 41 by visual or auditory means according to the instruction input from the instruction input device 43 by the operator. Therefore, the voice processing system 40 in the present embodiment can output clusters and segments including frequently appearing phrases included in the voice data.

また、本実施形態における音声処理システム４０は、操作者が音声からの人物の特定などの分析作業が容易に行える。その理由としては、本実施形態における音声処理システム４０は操作者が閲覧したい形態で、処理結果が出力装置４４に出力されるように構成されているためである。また、本実施形態における音声処理システム４０は、頻繁に出現するフレーズが視覚的、聴覚的に出力されることから、特定の人物がよく話す口癖や話題の傾向などを分析することができる。 In addition, the voice processing system 40 in the present embodiment allows the operator to easily perform analysis work such as identifying a person from the voice. The reason is that the voice processing system 40 in the present embodiment is configured so that the processing result is output to the output device 44 in a form that the operator wants to view. Further, since the voice processing system 40 in the present embodiment visually and audibly outputs phrases that frequently appear, it is possible to analyze the habits and topics that a specific person often speaks.

（具体例）
以下、本発明の第１の実施形態の具体例を説明する。図１０乃至図１２を用いて、音声処理装置１０が音声データからフレーズを抽出する一例を説明する。(Concrete example)
Hereinafter, specific examples of the first embodiment of the present invention will be described. An example in which the voice processing device 10 extracts a phrase from the voice data will be described with reference to FIGS. 10 to 12.

上記外部記憶装置が記憶する音声データからフレーズを抽出する一例の詳細について、図１０乃至図１２を用いて、説明する。図１０は、外部記憶装置が記憶する音声データの一例を示す図である。ここで、外部記憶装置は、例えば、第４の実施形態における音声入力装置４２によって実現される。 Details of an example of extracting a phrase from the voice data stored in the external storage device will be described with reference to FIGS. 10 to 12. FIG. 10 is a diagram showing an example of voice data stored in an external storage device. Here, the external storage device is realized by, for example, the voice input device 42 in the fourth embodiment.

図１０が示すように、外部記憶装置は、音声データとその音声データの識別子である音声データＩＤを記憶する。音声データＩＤが「１」の場合、外部記憶装置は、「・・・子どもを預かった。身代金を用意しろ。待ち合わせ場所は・・・」という音声データを記憶する。ここで、外部記憶装置は、図１０が示す音声データの内容に限らない。 As shown in FIG. 10, the external storage device stores the voice data and the voice data ID which is an identifier of the voice data. When the voice data ID is "1", the external storage device stores the voice data such as "... I took care of my child. Prepare a ransom. The meeting place is ...". Here, the external storage device is not limited to the content of the audio data shown in FIG.

図１１は、生成部１１が音声データからセグメントを生成する方法の一例を示す図である。図１１が示すように、生成部１１は、図１０が示す音声データ、すなわち音声データＩＤ「１」である「・・・子どもを預かった。身代金を用意しろ。待ち合わせ場所は・・・」から、複数のセグメントを生成する。図１１が示すように、セグメント１は「預かった」、セグメント２は「預かった。身代」である。図１１が示すように、生成部１１は、音声データを任意（所定の時間等）で細分化し、これらを用いて複数のセグメントを生成する。ここで、生成部１１は、音声データから、セグメント同士が重複するようにセグメントを生成する。すなわち、図１１が示すように、セグメント１は「預かった」、セグメント２は「預かった。身代」というように、セグメント１及び２では、「預かった」が重複している。これにより、音声処理装置１０は、音声データ内から求められるフレーズを抽出できる。 FIG. 11 is a diagram showing an example of a method in which the generation unit 11 generates a segment from voice data. As shown in FIG. 11, the generation unit 11 starts from the voice data shown in FIG. 10, that is, the voice data ID "1", "... I took care of my child. Prepare a ransom. The meeting place is ..." , Generate multiple segments. As shown in FIG. 11, segment 1 is “deposited” and segment 2 is “deposited. As shown in FIG. 11, the generation unit 11 arbitrarily subdivides the voice data (predetermined time, etc.) and uses these to generate a plurality of segments. Here, the generation unit 11 generates segments from the voice data so that the segments overlap each other. That is, as shown in FIG. 11, segment 1 is "deposited", segment 2 is "deposited. Substitute", and so on, "deposited" is duplicated in segments 1 and 2. As a result, the voice processing device 10 can extract the phrase required from the voice data.

図１２は、クラスタリング部１２が、複数のセグメントをまとめたクラスタを生成する方法の一例を示す図である。図１２が示すように、クラスタは、例えば、セグメントの内容（フレーズ）の識別子であるクラスタＩＤと、セグメントの内容と、全ての音声データ内で出現した、セグメントの内容（フレーズ）が出現した出現回数とを含む。なお、図１２に示す通り、本具体例では、クラスタＩＤと、図１１で示したセグメントの番号とは同じであるとして説明を行う。クラスタは、例えば、クラスタＩＤが「１」のフレーズ「預かった」が全ての音声データ内で２０回出現したことを示す。すなわち、クラスタリング部１２は、図１１が示すように生成部１１が生成した複数のセグメントから、各セグメント間の類似度を計算し、類似度の高い、すなわち同じセグメント同士をまとめたクラスタを生成する。 FIG. 12 is a diagram showing an example of a method in which the clustering unit 12 generates a cluster in which a plurality of segments are grouped together. As shown in FIG. 12, in the cluster, for example, the cluster ID which is an identifier of the content (phrase) of the segment, the content of the segment, and the appearance of the content (phrase) of the segment appearing in all the voice data appear. Including the number of times. As shown in FIG. 12, in this specific example, the cluster ID and the segment number shown in FIG. 11 will be described as being the same. The cluster indicates, for example, that the phrase "deposited" with the cluster ID "1" appears 20 times in all the voice data. That is, as shown in FIG. 11, the clustering unit 12 calculates the similarity between each segment from the plurality of segments generated by the generation unit 11, and generates a cluster having a high degree of similarity, that is, the same segments are grouped together. ..

選択部１３は、クラスタに含まれるセグメントの個数および総時間長を用いてクラスタを比較し、所定の条件を満たすクラスタを選択する。選択部１３は、例えば、クラスタリング部１２が生成した複数のクラスタの中で、各クラスタに含まれるセグメントの数、つまり、フレーズの出現回数に基づき比較する。図１２が示すように、選択部１３は出現回数が３５回の「身代金を」と出現回数３０回の「身代金を用意しろ。」を選択する。次に、選択部１３は、各クラスタのサイズに基づき比較する。選択部１３は、例えば、出現回数とセグメントの長さ、すなわち時間長との掛け算の結果を各クラスタのサイズとし、各クラスタのサイズが一番大きいクラスタを選択する。 The selection unit 13 compares the clusters using the number of segments included in the cluster and the total time length, and selects a cluster that satisfies a predetermined condition. The selection unit 13 compares, for example, based on the number of segments included in each cluster among the plurality of clusters generated by the clustering unit 12, that is, the number of occurrences of the phrase. As shown in FIG. 12, the selection unit 13 selects "ransom" with 35 appearances and "prepare a ransom" with 30 appearances. Next, the selection unit 13 compares based on the size of each cluster. For example, the selection unit 13 sets the result of multiplication of the number of occurrences and the segment length, that is, the time length as the size of each cluster, and selects the cluster having the largest size of each cluster.

図１２が示すように、選択部１３は、例えば、クラスタＩＤが７のクラスタと、クラスタＩＤが８のクラスタとを比較する。選択部１３は、出現回数３５回と「身代金を」の時間長との掛け算の結果と、出現回数３０回と「身代金を用意しろ。」の時間長との掛け算の結果とを比較し、セグメントの内容が「身代金を用意しろ。」であるクラスタを選択する。すなわち、「身代金を用意しろ。」というフレーズが所望のフレーズである。また、選択部１３は、出現回数が同じクラスタ同士の比較の場合は、セグメントの時間長のみを比較して選定してもよい。なお、選択部１３は、上記の方法に限定されず、出現回数や時間長その他音素の数等様々な指標に基づいてサイズを定義し、比較して良い。 As shown in FIG. 12, the selection unit 13 compares, for example, a cluster having a cluster ID of 7 and a cluster having a cluster ID of 8. The selection unit 13 compares the result of multiplication of the number of appearances of 35 times with the time length of "ransom" and the result of multiplication of the number of appearances of 30 times with the time length of "prepare the ransom." Select the cluster whose content is "Prepare the ransom." That is, the phrase "prepare the ransom" is the desired phrase. Further, in the case of comparison between clusters having the same number of appearances, the selection unit 13 may select by comparing only the time lengths of the segments. The selection unit 13 is not limited to the above method, and the size may be defined and compared based on various indexes such as the number of appearances, the time length, and the number of phonemes.

そして、抽出部１４は、選択されたクラスタからセグメントを抽出する。これにより、内容が「身代金を用意しろ」であるセグメントである音声データが、抽出される。このセグメントの音声データによって「身代金を用意しろ」というフレーズが頻繁に音声データ中に含まれていることがわかる。 Then, the extraction unit 14 extracts the segment from the selected cluster. As a result, the voice data, which is a segment whose content is "prepare the ransom", is extracted. From the voice data of this segment, it can be seen that the phrase "prepare the ransom" is frequently included in the voice data.

以上のように、本具体例における音声処理装置１０では、例えば、「・・・子どもを預かった。身代金を用意しろ。待ち合わせ場所は・・・」という音声データから頻出フレーズである「身代金を用意しろ」を抽出することが可能である。 As described above, in the voice processing device 10 in this specific example, for example, the phrase "prepare the ransom" which is a frequent phrase from the voice data such as "... I took care of my child. Prepare the ransom. The meeting place is ...". It is possible to extract "white".

以上、実施形態および具体例を用いて本願発明を説明したが、本発明は必ずしも上記実施形態および具体例に限定されるものではない。本発明の構成や詳細には、本発明のスコープ内で当業者が理解しうる（その技術的思想の範囲内において）様々な変更をし、実施することができる。 Although the present invention has been described above with reference to embodiments and specific examples, the present invention is not necessarily limited to the above embodiments and specific examples. Various changes and implementations of the present invention can be made and implemented within the scope of the present invention, which can be understood by those skilled in the art (within the scope of its technical ideas).

この出願は、２０１５年３月２５日に出願された日本出願特願２０１５−０６１８５４を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims priority on the basis of Japanese application Japanese Patent Application No. 2015-061854 filed on March 25, 2015, and incorporates all of its disclosures herein.

１０音声処理装置
１１生成部
１２クラスタリング部
１３選択部
１４抽出部
１５正規化学習部
１６音声データ正規化部
１７音声データ処理部
１８音声データ分類部
２０音声処理装置
３０音声処理装置
４０音声処理システム
４１音声処理装置
４２音声入力装置
４３指示入力装置
４４出力装置
１０１音声データ記憶部
１０２不特定音響モデル記憶部
１０３パラメタ記憶部
１０００コンピュータ
１００１ＣＰＵ
１００２主記憶装置
１００３補助記憶装置
１００４インターフェース
１００５入力デバイス
１００６ディスプレイ装置10 Speech processing device 11 Generation section 12 Clustering section 13 Selection section 14 Extraction section 15 Normalization learning section 16 Voice data normalization section 17 Voice data processing section 18 Voice data classification section 20 Speech processing device 30 Speech processing device 40 Speech processing system 41 Voice processing device 42 Voice input device 43 Instruction input device 44 Output device 101 Voice data storage unit 102 Unspecified acoustic model storage unit 103 Parameter storage unit 1000 Computer 1001 CPU
1002 Main storage device 1003 Auxiliary storage device 1004 Interface 1005 Input device 1006 Display device

Claims

A first generation means for generating a plurality of segments in which adjacent segments overlap at least partially from voice data,
A second generation means for classifying the plurality of segments based on phonological similarity to generate clusters related to the segments, and
Based on the size of the cluster according to the segment, and selecting means for selecting a cluster according to a predetermined condition is satisfied segments,
A voice processing device including an extraction means for extracting a segment included in a cluster related to the selected segment .
A third generation means for generating a plurality of normalization parameters for normalizing the difference in acoustic properties of the plurality of audio data based on the plurality of audio data, and
Further provided with a normalization means for normalizing the voice data by using the plurality of normalization parameters.
The first generation means generates the plurality of segments from the normalized voice data.
Further provided with a fourth generation means of classifying the voice data based on the difference in acoustic properties and generating clusters related to the voice data.
The third generation means generates a normalization parameter for the cluster related to the voice data.
Voice processing device .

The voice processing device according to claim 1 , wherein the selection means compares and selects the clusters related to the segments by using the number of segments or the total time length included in the clusters related to the segments.

The voice processing apparatus according to claim 1 or 2 , wherein the second generation means calculates the similarity between segments by comparing the acoustic features constituting the segment.

The voice processing apparatus according to claim 1 , wherein the second generation means generates similarity by DP (Dynamic Programming) matching between the segments.

Any of claims 1 to 4, wherein the fourth generation means and the third generation means are alternately and repeatedly executed until the results converge or the number of executions reaches a predetermined threshold value based on the mutual results . The voice processing device according to item 1 .

From the audio data, generate multiple segments with at least some overlap of adjacent segments,
The plurality of segments are classified based on phonological similarity to generate clusters related to the segments .
Based on the size of the cluster in accordance with the segment, to select the cluster in accordance with a predetermined condition is satisfied segments,
A voice processing method for extracting segments included in a cluster related to the selected segment .
When the plurality of segments are generated, a plurality of normalization parameters for normalizing the difference in acoustic properties of the plurality of audio data are generated based on the plurality of audio data, and the plurality of normalization parameters are generated. To normalize the audio data and generate the plurality of segments from the normalized audio data.
When generating the plurality of normalization parameters, the voice data is classified based on the difference in acoustic properties to generate a cluster related to the voice data, and a normalization parameter is generated for the cluster related to the voice data. ,
Voice processing method .

The process of generating multiple segments from audio data in which at least some of the adjacent segments overlap.
The process of classifying the plurality of segments based on phonological similarity to generate clusters related to the segments, and
Based on the size of the cluster according to the segment, the process of selecting one or more clusters according to a predetermined condition is satisfied segments,
A program that causes a computer to execute the process of extracting the segments included in the cluster related to the selected segment .
The process of generating the plurality of segments includes a process of generating a plurality of normalization parameters for normalizing the difference in acoustic properties of the plurality of audio data based on the plurality of audio data, and the process of generating the plurality of normalizations. It includes a process of normalizing the voice data using a normalization parameter and a process of generating the plurality of segments from the normalized voice data.
The process of generating the plurality of normalization parameters classifies the voice data based on the difference in acoustic properties to generate a cluster related to the voice data, and generates a normalization parameter for the cluster related to the voice data. To do,
Program .

An instruction input device that receives the operator's instruction information,
A voice input device that inputs voice data to a voice processing device,
The voice processing device according to any one of claims 1 to 5 , which executes processing on the input voice data based on the instruction information.
An output device for outputting the processing result of the voice processing device is provided.
The output device is a voice processing system that outputs the processing result according to the instruction information.