JP5669707B2

JP5669707B2 - Similar document search device

Info

Publication number: JP5669707B2
Application number: JP2011217149A
Authority: JP
Inventors: ▲シン▼ 徐; 加藤　恒夫; 恒夫加藤
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2011-09-30
Filing date: 2011-09-30
Publication date: 2015-02-12
Anticipated expiration: 2031-09-30
Also published as: JP2013077193A

Description

本発明は、コンテンツ関連用語によるコンテンツの検索システムや、大量の文書データ（テキスト文書）を検索するテキストの検索システムにおいて使用される検索装置に関し、クエリと称する検索用の特定の文字列（以降、「検索文字列」と言う）が検索対象とする文書の原文と一致しない場合であっても、類似度を高速に判別して複数の検索対象文書から目的の文書を検索できるようにした類似文書検索装置に関する。 The present invention relates to a search device used in a content search system based on content-related terms and a text search system for searching a large amount of document data (text document), and a specific character string (hereinafter referred to as a query) called a query. Even if the search string does not match the original text of the document to be searched, a similar document that allows the target document to be searched from multiple search target documents by quickly determining the similarity The present invention relates to a search device.

例えば、非特許文献1のＦＡＳＴＡプログラムでは、N-Gram要素（連続したN個文字）を単位とする２種類の検索手順を実行（２パス検索）することで、指定したDNA配列に類似する配列の高速データベース検索を実現している。
すなわち、初めに、検索対象となる文字列（DNAデータベースに格納しているプレーンテキストフォーマットのタンパク質のアミノ酸配列）をN-Gram要素に分解し、上記N-Gram要素とそれの出現箇所を照合用のテーブルに登録する。
検索時では、第１パスで検索文字列（指定したタンパク質のアミノ酸配列）を同様にN-Gram要素に分解して照合用のテーブルを検索し、検索文字列のN-Gram要素が含まれる割合（一致度）の高い検索対象の部分（DNAデータベースのタンパク質のアミノ酸配列中の数箇所）を検索対象候補として出力する。
そして、第２パスでこれらの候補に限られた領域でスミス−ウォーターマン法に基づく動的計画法演算を行う。 For example, in the FASTA program of Non-Patent Document 1, a sequence similar to a specified DNA sequence is obtained by executing two types of search procedures (two-pass search) in units of N-Gram elements (consecutive N characters). High-speed database search is realized.
That is, first, the character string to be searched (the amino acid sequence of the protein in plain text format stored in the DNA database) is decomposed into N-Gram elements, and the above N-Gram elements and their occurrences are used for collation. Register in the table.
When searching, the search string (amino acid sequence of the specified protein) in the first pass is similarly decomposed into N-Gram elements and the matching table is searched, and the percentage of N-Gram elements in the search string is included The search target part (several places in the amino acid sequence of the protein in the DNA database) having a high (degree of matching) is output as a search target candidate.
Then, a dynamic programming calculation based on the Smith-Waterman method is performed in a region limited to these candidates in the second pass.

また、特許文献１の「音声検索システム」では、検索対象となるニュースや講演などの音声データを音声認識により予め文字列に変換しておき、入力された検索文字列で検索するシステムにおいて、検索文字列に対応する音声部分が誤認識されていて、音声認識結果の文字列と検索文字列と一致しない場合にも検索可能で、かつ高速な検索を可能にする技術が開示されている。検索手法としては、上記音声認識に使われた認識辞書の単語において、検索文字列と音響的に類似度が高い単語列候補（認識辞書の単語の並び）を効率的に求め、上記単語列候補の中に上記認識結果に出現したものを検索結果として出力する。上記単語列候補を求める際に、入力された検索文字列を音素列に変換し、連続単語音声認識で用いられているフレーム同期ビームサーチ（入力特徴ベクトルを上記検索文字列の音素列に置き換えること）やレベルビルディングなどの手法を用い、検索文字列の音素列を認識辞書中の単語の様々な組み合わせの音素列と照合することによって、類似度の高い単語組合せを候補として選び出す。 Further, in the “voice search system” of Patent Document 1, voice data such as news and lectures to be searched is converted into a character string in advance by voice recognition, and the search is performed using the input search character string. There has been disclosed a technique that enables search even when a voice portion corresponding to a character string is misrecognized and does not match the character string of the voice recognition result and the search character string, and enables high-speed search. As a search method, in the words of the recognition dictionary used for the speech recognition, a word string candidate (sequence of words in the recognition dictionary) that is acoustically similar to the search character string is efficiently obtained, and the word string candidate What appears in the recognition result is output as a search result. When obtaining the word string candidates, the input search character string is converted into a phoneme string, and a frame-synchronized beam search used in continuous word speech recognition (the input feature vector is replaced with the phoneme string of the search character string) ) And level building or the like, and by comparing the phoneme string of the search character string with the phoneme string of various combinations of words in the recognition dictionary, word combinations having high similarity are selected as candidates.

また、本発明者らは、歌詞の一部分を入力して楽曲を検索する文書データ検索装置において、ユーザが聞き間違えて覚えてしまった歌詞で検索した場合にも、類似度を高速に判別して目的歌詞の検索ができる検索装置を特開２０１０−１２３００５号（特許文献２）で提案した。
この検索装置では、まず入力される検索文字列に出現しそうな単語を予測し、検索対象である文書（歌詞）との類似度を表す距離値を予め計算しておき、INDEXに登録する。上記距離値の計算方法は、各検索対象の文書において、予測される検索文字列について構成単語を抽出し、それぞれ音素列に変換してから、上記単語の音素列と逐次に音素ベースでのＤＰマッチングを行い、最も低いＤＰマッチングコスト値を当文書と予測した単語の距離値にするというものである。
検索時には、検索文字列から単語を抽出し、各検索対象文書に対して上記INDEXを検索し、検索文字列を構成する単語（出現順番を無視）に対する距離値を合計して類似度とする。そして、検索文字列と検索対象文書との類似度のもとに、検索結果が出力される。 In addition, in the document data search apparatus for searching for music by inputting a part of lyrics, the present inventors can determine the similarity at high speed even when searching with lyrics that the user has mistakenly remembered. Japanese Patent Application Laid-Open No. 2010-123005 (Patent Document 2) proposed a search device capable of searching for target lyrics.
In this search device, first, a word that is likely to appear in an input search character string is predicted, and a distance value that represents a similarity to a document (lyric) that is a search target is calculated in advance and registered in INDEX. In the distance value calculation method, in each search target document, constituent words are extracted from predicted search character strings, converted into phoneme strings, respectively, and then phoneme-based DPs sequentially with the phoneme strings of the words. Matching is performed, and the lowest DP matching cost value is set as the distance value between the predicted word and the document.
At the time of search, words are extracted from the search character string, the above INDEX is searched for each search target document, and the distance values for the words constituting the search character string (ignoring the appearance order) are summed to obtain the similarity. A search result is output based on the similarity between the search character string and the search target document.

特開２００６−３１２７８号公報JP 200631278 A 特開２０１０−１２３００５号公報JP 2010-123055 A

プロシーディングスオブナショナルアカデミーオブサイエンシズユーエスエー８５巻，２４４４−２４４８頁(１９８８年)(Proc.Natl. Acad. Sci. U.S.A., vol.85, pp.2444−2448(1988)）Proceedings of National Academy of Sciences USA 85, 2444-2448 (1988) (Proc. Natl. Acad. Sci. U.S.A., vol. 85, pp. 2444-2448 (1988))

歌詞検索や映画の台本検索などの文書検索を行うに際し、検索文字列に聞き間違いや記憶違い等に起因する部分的誤りが含まれることがしばしば生じる。例えば、歌詞の原文に存在する文字列が「ムスウノヒカリ」であるのに対して、検索文字列について、表記が異なるが音響的には類似する「マッスグナヒカリ」と誤って入力して検索するような場合である。 When performing document search such as lyrics search or movie script search, the search character string often includes partial errors due to mistakes in listening or memory differences. For example, while the original text of the lyrics is “Musunohikari”, the search character string may be incorrectly entered as “Massgunakari” which is different in sound but acoustically similar. This is the case.

このような検索を行う場合、非特許文献１では２パス処理による検索高速化を実現しているが、第1パスでは、検索対象となる文書や文字列において、検索文字列のN-Gram要素が含まれる割合（一致度）の高い部分を検索対象候補とするため、含まれないN-Gram要素と検索対象との類似度を考慮しないという問題点がある。
すなわち、上述した検索例に対しては、検索対象となる「ムスウノヒカリ」において、検索文字列「マッスグナヒカリ」のN-Gram要素中に「ヒカリ」だけ含まれていて、N-Gram要素の一致度が低くなるため、正解の「ムスウノヒカリ」が第1パスにて検索候補から外されてしまい検索精度が低くなる。 In such a search, non-patent document 1 realizes a high-speed search by two-pass processing, but in the first pass, in the document or character string to be searched, the N-Gram element of the search character string Since a portion with a high percentage (matching degree) is included as a search target candidate, there is a problem that the similarity between the N-Gram element not included and the search target is not considered.
That is, for the search example described above, in the “Musunohikari” to be searched, only the “Hikari” is included in the N-Gram element of the search string “Mass Gunaharikari”, and the N-Gram element matches Since the degree is low, the correct answer “Musunohikari” is excluded from the search candidates in the first pass, and the search accuracy decreases.

特許文献１のシステムでは音響的な類似度を考慮しているが、検索するたびに、音声認識辞書の全ての単語において、検索文字列の類似単語列候補を求めるための類似度計算を行う必要がある。現状では、Web上の検索対象が多い（商用歌詞検索システムでは数万、数十万以上の歌詞を対象とする）場合に、類似度を計算する際の探索空間（単語組合せの種類数）が膨大になるため、オンラインでの検索が非常に遅くなるという問題点があった。 Although the acoustic similarity is taken into consideration in the system of Patent Document 1, it is necessary to perform similarity calculation for obtaining similar word string candidates of search character strings for all words in the speech recognition dictionary each time a search is performed. There is. At present, when there are many search targets on the Web (in the case of commercial lyrics search systems, which target tens of thousands or hundreds of thousands of lyrics), there is a search space (number of types of word combinations) for calculating similarity. There is a problem that online search becomes very slow due to the enormous volume.

特許文献２の検索装置では、検索文字列に出現可能な単語を予測してINDEXに登録するため、実際に入力された検索文字列の単語が未登録（OUT-OF-VOCABULRY問題）である場合には、オンラインでその検索文字列を音素列に変換し、各検索対象文書の音素列とのＤＰマッチングを行い、その結果得られたＤＰマッチングコスト値に基づいて検索を行うので、検索時間が遅くなるという問題点があった。また、仮に存在する全ての単語をINDEXに登録するとすれば、INDEXのサイズが膨大になるため、実際のシステムへの実装にはコストが高くなるという問題点があった。
また、特許文献２の検索装置では、第２パスでは、第１パスの近似距離計算によるソート情報などを利用せず、全ての検索対象候補についてＤＰマッチングによる計算を行うため、検索効率が低下するという問題点があった。 In the search device of Patent Document 2, since a word that can appear in a search character string is predicted and registered in INDEX, the word of the actually input search character string is not registered (OUT-OF-VOCABULRY problem) The search character string is converted online into a phoneme string, DP matching with the phoneme string of each search target document is performed, and the search is performed based on the DP matching cost value obtained as a result. There was the problem of being slow. Further, if all existing words are registered in the INDEX, the size of the INDEX becomes enormous, and there is a problem that the cost for implementation in an actual system becomes high.
Further, in the search device of Patent Document 2, the second pass does not use sort information by the approximate distance calculation of the first pass, and performs the calculation by DP matching for all search target candidates, so that the search efficiency is reduced. There was a problem.

本発明は、上記実情に鑑みて提案されたもので、文書検索を行うに際して、検索文字列に聞き間違いや記憶違い等に起因する部分的誤りが含まれる場合であっても、複数の検索対象文書から目的の文書の検索を高速かつ高精度に実現できる類似文書検索装置を提供することを目的としている。 The present invention has been proposed in view of the above circumstances, and when performing a document search, a plurality of search objects are obtained even if the search character string includes partial errors due to mishearing or memory errors. An object of the present invention is to provide a similar document search apparatus capable of realizing a search for a target document from a document at high speed and with high accuracy.

本発明は上記目的を達成するため、類似文書検索装置において、文書の構成要素となるサブワード連鎖（連続する一定数の音素または音節）を単位として検索文字列と検索対象文書との類似度を考慮するものである。そして、検索文字列と検索対象文書との類似度を考慮する場合、予め構築した検索用INDEXを用いて第１段階の検索（第１パス）を行って検索対象文書の候補を絞り、続いて第２段階の検索（第２パス）を行う２（種類の）パス検索アルゴリズムを導入する。 In order to achieve the above object, the present invention considers the similarity between a search character string and a search target document in units of a subword chain (a fixed number of continuous phonemes or syllables) as a component of a document in a similar document search apparatus. To do. Then, when considering the similarity between the search character string and the search target document, the search target document candidates are narrowed down by performing the first-stage search (first pass) using the pre-built search index. A 2 (type) path search algorithm for performing the second stage search (second path) is introduced.

先ず、上記検索用INDEXの構築では、未知の検索文字列に対応するため、Ｎ個の音節（又は音素）の組合せ若しくは大量の文書から所望数の音節（又は音素）から構成される音節列（音節N-Gram）又は音素列(音素N-Gram）をサブワード連鎖として集める。
音節列（音節N-Gram）又は音素列(音素N-Gram）は、その長さ（音節又は音素の数）によって種類が非常に多くなるため、音節N-Gram又は音素N-Gram同士間の距離に基づいてクラスタリングを行う。クラスタリングで得たサブワード連鎖クラスタ（音節列クラスタ又は音素列クラスタ）と検索対象文書（歌詞データ、台本など）との各類似度に基づいて予め検索用INDEXのテーブルを作成しておく。 First, in the construction of the search index, in order to deal with an unknown search character string, a combination of N syllables (or phonemes) or a syllable string composed of a desired number of syllables (or phonemes) from a large number of documents ( Syllable N-Gram) or phoneme string (phoneme N-Gram) is collected as a subword chain.
The syllable string (syllable N-Gram) or phoneme string (phoneme N-Gram) varies greatly depending on its length (syllable or number of phonemes), so between syllable N-Gram or phoneme N-Gram Clustering is performed based on the distance. A search index table is created in advance based on the similarity between the subword chain cluster (syllable string cluster or phoneme string cluster) obtained by clustering and the search target document (lyric data, script, etc.).

検索文字列の検索時には、上述の第１パスで検索文字列を複数のサブワード連鎖に変換し、検索用INDEXを用いた高速な探索によりサブワード連鎖と類似度の高い検索対象候補の絞り込みを行い、上述の第２パスで絞り込んだ検索対象候補と検索文字列との音素列における音素ベースでの端点フリーＤＰマッチングによる高精度探索を行うものである。
そして、第２パスでは、全ての検索対象候補についてＤＰマッチングによる計算を行うのではなく、第１パスで検索された上位から順に検索対象候補のグループを複数作成し、グループ毎に属する検索対象候補に対してＤＰマッチングを計算し、計算結果により以降のグレープのＤＰマッチング計算を打ち切るようにすることで漸進的なＤＰマッチングの計算を行う。 When searching for a search character string, the search character string is converted into a plurality of subword chains in the first pass, and the search target candidates having high similarity to the subword chain are narrowed down by high-speed search using the search index. A high-precision search is performed by end-point free DP matching based on phonemes in a phoneme string of search target candidates narrowed down in the second pass and a search character string.
In the second pass, instead of calculating by DP matching for all search target candidates, a plurality of search target candidate groups are created in order from the top searched in the first pass, and the search target candidates belonging to each group The DP matching is calculated for the above-mentioned, and the DP matching calculation for the subsequent grapes is discontinued based on the calculation result.

すなわち、請求項１の類似文書検索装置は、文字列又は音声認識により変換された文字列を検索文字列として入力する入力インタフェースを備え、複数の検索対象文書から前記検索文字列と類似する部分を含む検索対象文書を検索する検索装置において、次の構成を含むことを特徴としている。
検索用INDEX作成手段。検索用INDEX作成手段は、複数文書における複数の音節又は音素（以下、「サブワード」という）から予め決められた所望数のサブワードを繋げた組み合わせで形成される音節列又は音素列（以下、「サブワード連鎖」という）を作成し、前記各サブワード連鎖間の距離に基づいてクラスタリングした複数のサブワード連鎖が属するグループを音節列クラスタ又は音素列クラスタ（以下、「サブワード連鎖クラスタ」という）として登録し、登録された複数のサブワード連鎖クラスタと前記検索対象文書との音素列同士を比較して類似度をそれぞれ算出し、前記各サブワード連鎖クラスタに対応させた類似度のテーブルを検索用INDEXとして予め作成する。 That is, the similar document search apparatus according to claim 1 includes an input interface for inputting a character string or a character string converted by speech recognition as a search character string, and a portion similar to the search character string from a plurality of search target documents. A search apparatus for searching for a search target document includes the following configuration.
INDEX creation means for search . The index creation means for search is a syllable string or phoneme string (hereinafter referred to as “subword”) formed by combining a predetermined number of subwords from a plurality of syllables or phonemes (hereinafter referred to as “subwords”) in a plurality of documents . A group including a plurality of subword chains clustered based on the distance between the subword chains and registering as a syllable string cluster or a phoneme string cluster (hereinafter referred to as “subword chain cluster”). The similarity is calculated by comparing the phoneme strings of the plurality of subword chain clusters and the search target document, and a similarity table corresponding to each subword chain cluster is created in advance as a search index.

サブワード連鎖は、複数のサブワードを繋げた組み合わせで作成されるが、一つのサブワードで構成されるものであってもよい。 The subword chain is created by combining a plurality of subwords , but may be composed of one subword .

サブワード連鎖抽出手段。サブワード連鎖抽出手段は、前記入力インタフェースに入力された検索文字列を構成する音節又は音素から前記所望の数を繋げた複数の音節列又は音素列をサブワード連鎖として抽出する。
類似度演算手段。類似度演算手段は、前記サブワード連鎖抽出手段により抽出した各サブワード連鎖について、各サブワード連鎖が属するサブワード連鎖クラスタに対応する各類似度を前記検索用INDEXから求め、前記各類似度の合計を前記各検索対象文書に対する類似度として演算する。
出力インタフェース。出力インタフェースは、前記類似度に基づいた文書データの検索結果を出力するものである。 Subword chain extraction means. The sub-word chain extracting means extracts a plurality of syllable strings or phoneme strings connecting the desired numbers from the syllables or phonemes constituting the search character string input to the input interface as sub-word chains.
Similarity calculation means. For each subword chain extracted by the subword chain extraction means, the similarity calculation means obtains each similarity corresponding to the subword chain cluster to which each subword chain belongs from the search INDEX, and calculates the total of each similarity Calculated as the similarity to the search target document.
Output interface. The output interface outputs a search result of document data based on the similarity.

前記類似度演算手段は、検索用INDEXから求めた前記各サブワード連鎖が属するサブワード連鎖クラスタに対応する類似度の合計が高い順で上位複数個の検索対象文書候補の絞り込みを行う第１パスINDEX検索と、前記上位複数個の検索対象文書候補を音素列に変換し、検索文字列の音素列との間で音素を単位とした文字列類似度計算を行って得られる類似度により出力順位を決める第２パス文字列類似度計算による検索とを実行する。
そして、前記第２パス文字列類似度計算による検索は、前記第１パスINDEX検索で検索した検索対象文書候補について上位からＭ個毎を一つのグループとする複数セッションに分割し、上位候補のセッション順に順次各セッションに属する検索対象文書候補について文字列類似度計算を行って類似度を算出し、セッション内における類似度の最小値と平均値に基づいて算出されるセッション類似度Ｄが所定の閾値αより小さい場合に、以降のセッションでの文字列類似度計算による類似度算出を打ち切り、その時点で算出された類似度により検索対象文書候補の出力順位を決める。 The similarity calculation means performs a first pass INDEX search for narrowing down a plurality of search target document candidates in descending order of the similarity corresponding to the subword chain cluster to which each of the subword chains found from the search INDEX belongs. The output ranking is determined based on the similarity obtained by converting the plurality of search target document candidates into a phoneme string and performing a string similarity calculation with the phoneme string of the retrieved character string as a unit. A search based on the second pass character string similarity calculation is executed .
The search by the second pass character string similarity calculation is performed by dividing the search target document candidates searched by the first pass INDEX search into a plurality of sessions each having a group of M from the top as a group. The character string similarity calculation is sequentially performed on the search target document candidates belonging to each session in order to calculate the similarity, and the session similarity D calculated based on the minimum value and the average value of the similarity in the session is a predetermined threshold value. If it is smaller than α, similarity calculation by character string similarity calculation in subsequent sessions is terminated, and the output order of search target document candidates is determined based on the similarity calculated at that time.

請求項２は、請求項１の類似文書検索装置において、前記セッション類似度Ｄ及び前記閾値αは、セッション内における類似度の最小値Ｄ(min)、平均値Ｄ(mean)に対して、
セッション類似度Ｄ＝Ｄ(min)／Ｄ(mean)、
閾値α＝０．０〜１．０
としたことを特徴としている。 In the similar document search device according to claim 1, the session similarity D and the threshold value α are set to a minimum value D (min) and an average value D (mean) of the similarity in the session.
Session similarity D = D (min) / D (mean),
Threshold value α = 0.0 to 1.0
It is characterized by that.

請求項３は、請求項２の類似文書検索装置において、セッション内の標準偏差値をＤ(s)とした場合、
セッション類似度Ｄ＝Ｄ（s）／（Ｄ(mean)‐Ｄ(min)）、
としたことを特徴としている。 Claim 3 is the similar document search apparatus according to claim 2, wherein the standard deviation value in the session is D (s).
Session similarity D = D (s) / (D (mean) −D (min)),
It is characterized by that.

請求項４は、請求項１の類似文書検索装置において、前記INDEX作成手段における類似度は、各サブワード連鎖クラスタを代表する音節列または音素列と検索対象文書をそれぞれ音素列に変換し、二つの音素列において音素を単位とした文字列類似度計算を行って得られた累積文字列類似度値に基づいて算出することを特徴としている。 According to a fourth aspect of the present invention, there is provided the similar document retrieval apparatus according to the first aspect, wherein the similarity in the INDEX creation means is obtained by converting a syllable string or a phoneme string representing each subword chain cluster and a search target document into a phoneme string. In the phoneme string, calculation is performed based on a cumulative character string similarity value obtained by performing character string similarity calculation in units of phonemes.

請求項５は、請求項４の類似文書検索装置において、前記音素列間の文字列類似度計算は、音素の置換コスト値、音素の挿入コスト値、又は、音素の脱落コスト値を音声認識実験による音素混同行列に基づいて計算することを特徴としている。 5. The similar document search apparatus according to claim 4, wherein the character string similarity calculation between the phoneme strings is performed by performing a speech recognition experiment using a phoneme replacement cost value, a phoneme insertion cost value, or a phoneme dropout cost value. The calculation is based on the phoneme confusion matrix.

請求項６は、請求項５の類似文書検索装置において、前記置換コスト値、挿入コスト値及び脱落コスト値は、経験値や発音辞書により決められる値であることを特徴としている。 According to a sixth aspect of the present invention, in the similar document search apparatus of the fifth aspect, the replacement cost value, the insertion cost value, and the dropout cost value are values determined by an experience value or a pronunciation dictionary.

請求項７は、請求項１の類似文書検索装置において、前記サブワード連鎖クラスタは、複数のサブワード連鎖が属するグループとなり、一つだけのサブワード連鎖であることを特徴としている。 According to a seventh aspect of the present invention, in the similar document search apparatus according to the first aspect, the subword chain cluster is a group to which a plurality of subword chains belong, and is only one subword chain.

請求項８は、請求項１の類似文書検索装置において、前記入力インタフェースは、歌詞の一部分をテキストデータで入力することを特徴としている。 According to an eighth aspect of the present invention, in the similar document search apparatus according to the first aspect, the input interface inputs a part of the lyrics as text data.

請求項９は、請求項１の類似文書検索装置において、前記入力インタフェースは、歌詞の一部分を音声で入力し、入力した音声を音声認識手法によってテキストデータに変換してから検索することを特徴としている。 9. The similar document search device according to claim 1, wherein the input interface inputs a part of the lyrics by voice, converts the input voice into text data by a voice recognition method, and searches. Yes.

請求項１０は、請求項１の類似文書検索装置において、前記出力インタフェースは、検索結果となる歌詞の楽曲名、アーティスト名、又は、歌詞全文が提示されることを特徴としている。 According to a tenth aspect of the present invention, in the similar document retrieval apparatus of the first aspect, the output interface presents a song name, an artist name, or a full text of a lyrics as a retrieval result.

請求項１１は、請求項１の類似文書検索装置において、前記出力インタフェースは、歌詞全文がリンクして表示され、表示する際に、クエリとマッチングした歌詞部分をハイライトすることを特徴としている。 In the similar document search device according to claim 11, the output interface displays the full text of the lyrics linked to each other, and highlights the lyric portion matched with the query when displayed.

請求項１２は、請求項１の類似文書検索装置において、前記サブワード連鎖は、複数のサブワードを繋げた組み合わせで作成されることを特徴としている。 According to a twelfth aspect of the present invention, in the similar document search apparatus according to the first aspect, the subword chain is created by a combination of a plurality of subwords .

請求項１３は、請求項１の類似文書検索装置において、前記サブワード連鎖は、一つのサブワードで構成されることを特徴としている。 According to a thirteenth aspect of the present invention, in the similar document search device according to the first aspect, the subword chain is composed of one subword .

請求項１４は、請求項１の類似文書検索装置において、前記文字列類似度計算は、二つの音素列において音素を単位とし、音素の挿入、削除、又は、置換のコストを考慮した端点フリーＤＰマッチング方法を用いることを特徴としている。 14. The similar document search apparatus according to claim 1, wherein the character string similarity calculation is performed by using a phoneme as a unit in two phoneme sequences, and an endpoint-free DP considering a cost of insertion, deletion, or replacement of phonemes. It is characterized by using a matching method.

請求項１５は、請求項１の類似文書検索装置において、前記文字列類似度計算は、二つの音素列の中で異なる音素の数に基づく類似度計算方法で行うことを特徴としている。 According to a fifteenth aspect of the present invention, in the similar document search apparatus according to the first aspect, the character string similarity calculation is performed by a similarity calculation method based on the number of different phonemes in two phoneme strings.

本発明によれば、検索用INDEXの作成段階において、各サブワード連鎖クラスタと検索対象文書との類似度を事前に計算しておくため、類似検索文字列の検索時における第１パスINDEX検索での絞り込み処理には類似度の計算が不要となる。したがって、検索文字列と検索対象文書との音素間で行われる計算量の多い文字列類似度計算は、第２パスで少量の検索対象候補に限定して行われるため、全体の計算量を減少させることで高速検索を実現することができる。 According to the present invention, since the similarity between each subword chain cluster and the search target document is calculated in advance at the time of creating the search INDEX, the first pass INDEX search at the time of searching for similar search character strings is performed. It is not necessary to calculate the similarity for the narrowing-down process. Therefore, since the calculation of the character string similarity with a large amount of calculation performed between the phonemes of the search character string and the search target document is limited to a small amount of search target candidates in the second pass, the total calculation amount is reduced. By doing so, high-speed search can be realized.

更に、第２パスの文字列類似度計算において、セッション内における類似度の最小値と平均値に基づいて算出されるセッション類似度Ｄが所定の閾値αより小さい場合に、類似度が小さく検索文字列と似ている検索対象文書候補が存在すると判断し、以降のセッションでの文字列類似度計算による類似度算出を打ち切るので、文字列類似度計算を行う検索文書候補の数を減少させることで、検索率を低下させることなく検出処理の速度向上を図ることができる。 Further, in the character string similarity calculation of the second pass, when the session similarity D calculated based on the minimum value and the average value of the similarity in the session is smaller than a predetermined threshold value α, the similarity is small and the search character Since it is determined that there are search target document candidates that are similar to the column and the similarity calculation by the character string similarity calculation in the subsequent session is terminated, the number of search document candidates that perform the character string similarity calculation can be reduced. Therefore, the speed of the detection process can be improved without reducing the search rate.

また、検索用INDEXの作成時に、検索文字列の構成部分となる可能性があるサブワード連鎖が含まれる音節列クラスタと検索対象文書の類似度を考慮するため、第１パスINDEX検索において正解となる検索対象文書を候補から外してしまう確率を低下させることができる。この結果、全検索対象に対する文字列類似度計算による検索と同程度の検索精度が実現できる。 Also, when creating a search INDEX, the first pass INDEX search is correct because the similarity between the search target document and the syllable string cluster including the subword chain that may be a constituent part of the search character string is considered. The probability that the search target document is excluded from the candidates can be reduced. As a result, it is possible to realize the same search accuracy as the search by the character string similarity calculation for all search targets.

また、検索用INDEX作成と検索文字列検索の際に、単語の代わりに、サブワード連鎖（音節列または音素列）を利用することによって、検索文字列と検索対象文書との間で対応する単語が存在しない（OUT-OF-VOCABULRY）という問題が生じることがなくなる。
さらに、クラスタリングによって検索用INDEXへのエントリ数が（音節列または音素列の数から音節列クラスタまたは音素列クラスタの数になったため）減るので、検索用INDEX作成のための音節列または音素列の種類が多い場合でも検索用INDEXサイズを削減することができる。 In addition, by using a subword chain (syllable string or phoneme string) instead of a word when creating a search INDEX and searching a search string, the corresponding word between the search string and the search target document can be obtained. The problem of non-existence (OUT-OF-VOCABULRY) will no longer occur.
Furthermore, the number of entries in the search INDEX is reduced by clustering (because the number of syllable strings or phoneme strings clusters is reduced from the number of syllable strings or phoneme strings), so the syllable string or phoneme string for creating the search INDEX is reduced. Even if there are many types, the index size for search can be reduced.

本発明の類似文書検索装置の実施形態の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of embodiment of the similar document search apparatus of this invention. 本発明の類似文書検索装置における検索用INDEXファイル作成処理手順を示すフローチャートである。It is a flowchart which shows the INDEX file creation processing procedure for search in the similar document search apparatus of this invention. 検索用INDEX作成手段で得られた音節列クラスタと歌詞データの音節列との類似度を示したテーブル図である。It is a table figure which showed the similarity degree of the syllable string cluster obtained by the search index creation means and the syllable string of the lyrics data. 検索用音素列Ｓ１「ｔｏｇａｎａｉ」と、検索対象音素列Ｓ２「…ｋｏｔｏｗａｎａｎｉａ…」との端点フリーＤＰマッチングによるＤＰ距離の計算の手順を説明するための計算結果表である。It is a calculation result table for demonstrating the procedure of calculation of DP distance by the end point free DP matching of search phoneme sequence S1 "toganai" and search object phoneme sequence S2 "... kobotanaia ...". 本発明の類似文書検索装置における検索処理の一例を示すフローチャートである。It is a flowchart which shows an example of the search process in the similar document search apparatus of this invention. 図５のフローチャートの第１パスINDEX検索における処理の詳細を示す説明図である。FIG. 6 is an explanatory diagram illustrating details of processing in a first pass INDEX search of the flowchart of FIG. 5. 第１パスINDEX検索により各歌詞データに対する近似音響距離Ｒの値の低い順で歌詞データをランキングした検索結果表である。It is the search result table which ranked lyric data in order with the low value of approximate acoustic distance R with respect to each lyric data by the 1st pass INDEX search. 図５のフローチャートの第２パスＤＰマッチング検索における処理の詳細を示すフローチャートである。6 is a flowchart showing details of processing in the second path DP matching search of the flowchart of FIG. 5.

以下、本発明の類似文書検索装置の一実施形態について、図１のブロック図を参照しながら説明する。
類似文書検索装置の一例として、歌詞の一部のフレーズ等を検索文字列として入力し、歌詞データや楽曲情報を出力する歌詞検索装置を例に説明する。歌詞検索装置では、歌詞の構成要素となるサブワード連鎖（連続する一定数の音素または音節）を単位として検索文字列と登録されている歌詞データとの類似度を考慮する。サブワード連鎖（音節列又は音素列）は、サブワード（音節又は音素）から予め決められた所望数のサブワードを繋げた組み合わせで形成されるものである。
歌詞検索装置となる類似文書検索装置１は、楽曲の歌詞の一部のフレーズ等を検索文字列として入力する入力インタフェース２と、歌詞データ（文書データ）の検索結果を出力する出力インタフェース３と、検索対象となる複数の歌詞データ（検索対象文書）が格納された歌詞データベース４と、入力インタフェース２に入力された検索文字列から歌詞データベース４に格納されている歌詞データの検索処理を行う制御部１０を有して構成されている。入力インタフェース２に入力される検索文字列は、音声認識により変換された文字列であってもよい。 Hereinafter, an embodiment of a similar document search apparatus of the present invention will be described with reference to the block diagram of FIG.
As an example of the similar document search apparatus, a lyrics search apparatus that inputs a part of phrases of lyrics as a search character string and outputs lyrics data and music information will be described as an example. In the lyric search device, the similarity between the search character string and the registered lyric data is taken into consideration in units of subword chains (constant continuous phonemes or syllables) that are constituent elements of the lyrics. A subword chain (syllable string or phoneme string) is formed by combining a predetermined number of subwords connected in advance from subwords (syllables or phonemes).
A similar document search device 1 serving as a lyrics search device includes an input interface 2 for inputting a part of phrases of lyrics of a song as a search character string, an output interface 3 for outputting a search result of lyrics data (document data), A lyrics database 4 in which a plurality of lyrics data to be searched (documents to be searched) are stored, and a control unit that performs a search process of lyrics data stored in the lyrics database 4 from a search character string input to the input interface 2 10. The search character string input to the input interface 2 may be a character string converted by voice recognition.

より具体的には、本発明の類似文書検索装置は、中央演算処理部（ＣＰＵ），ハードディスク等の記憶部，ドライバ等の読取部，キーボード等の入力部，ディスプレイ等の表示部等を備えた汎用のパーソナルコンピュータ（ＰＣ）や携帯情報機器に専用プログラムをインストールすることで構成される。したがってＰＣにおいて、入力インタフェース２は入力部に、出力インタフェース３は表示部に該当し、歌詞データベース４は記憶部内に構築され、音節列クラスタ作成手段１１、検索用INDEX作成手段１２、サブワード連鎖抽出手段１３、類似度演算手段１４における各種処理は中央演算処理部（ＣＰＵ）で実行される。また、入力インタフェース２を携帯端末とし、検索処理をＰＣで行うようにしてもよい。 More specifically, the similar document retrieval apparatus of the present invention includes a central processing unit (CPU), a storage unit such as a hard disk, a reading unit such as a driver, an input unit such as a keyboard, a display unit such as a display, and the like. It is configured by installing a dedicated program on a general-purpose personal computer (PC) or portable information device. Therefore, in the PC, the input interface 2 corresponds to the input unit, the output interface 3 corresponds to the display unit, the lyric database 4 is constructed in the storage unit, and the syllable string cluster generation unit 11, the search INDEX generation unit 12, the subword chain extraction unit. 13. Various processes in the similarity calculation means 14 are executed by a central processing unit (CPU). Alternatively, the input interface 2 may be a portable terminal and the search process may be performed by a PC.

入力インタフェース２へは、例えば、歌詞の一部分を入力部からのテキストデータで入力される。また、入力インタフェースは、歌詞の一部分を音声で入力し、入力した音声を音声認識手法によってテキストデータに変換するようにしてもよい。 For example, a part of the lyrics is input to the input interface 2 as text data from the input unit. Further, the input interface may input a part of the lyrics by voice and convert the input voice into text data by a voice recognition method.

制御部１０は、後述する音節N-Gram間の距離に基づいてクラスタリングしてサブワード連鎖クラスタ（音節列クラスタ）を登録する音節列クラスタ作成手段１１と、サブワード連鎖クラスタ（音節列クラスタ）と歌詞データベース４の各歌詞データ（検索対象文書）の音素列との類似度をテーブルとした検索用INDEXを予め作成する検索用INDEX作成手段１２と、検索文字列から複数のサブワード連鎖として抽出するサブワード連鎖抽出手段１３と、検索文字列に対する各歌詞データ（検索対象文書）の類似度を演算する類似度演算手段１４を備えている。 The control unit 10 performs clustering based on the distance between syllable N-Gram described later, and creates a syllable string cluster creating means 11 for registering a subword chain cluster (syllable string cluster), a subword chain cluster (syllable string cluster), and a lyrics database. INDEX creation means 12 for creating a search INDEX in advance with a table of similarities to the phoneme string of each lyric data (search target document) 4 and subword chain extraction for extracting a plurality of subword chains from the search character string Means 13 and similarity calculating means 14 for calculating the similarity of each lyric data (search target document) with respect to the search character string are provided.

音節列クラスタ作成手段１１は、大量の文書における多数の音節から予め設定された個数（Ｎ個）の音節を組み合わせることでサブワード連鎖として複数の音節N-Gramを作成し、各音節N-Gram間の距離に基づいてクラスタリングした複数の音節N-Gramが属するグループをサブワード連鎖クラスタ（音節列クラスタ）として登録する。
サブワード連鎖クラスタは、複数のサブワード連鎖が属するグループとなり、一つだけのサブワード連鎖であってもよい。
なお、音節の代わりに音素を使用してもよい。この場合、音節列クラスタ作成手段１１は、音素列クラスタ作成手段１１となり、予め設定された個数（Ｎ個）の音素を組み合わせることで複数の音素N-Gramを作成し、各音素N-Gram間の距離に基づいてクラスタリングした複数の音素N-Gramが属するグループを音素列クラスタとして登録すればよい。
各音節N-Gram間（各音素N-Gram間）の距離は、各音節N-Gramを構成するＮ個の音節を構成する音素同士（音素の数が異なる場合もある）の比較で行う文字列類似度計算（ＤＰマッチング）で算出する。ＤＰマッチングについては後述する。 The syllable string cluster creating means 11 creates a plurality of syllable N-Grams as a subword chain by combining a predetermined number (N) of syllables from a large number of syllables in a large number of documents, and between each syllable N-Gram. A group to which a plurality of syllable N-Grams clustered on the basis of the distance is registered as a subword chain cluster (syllable string cluster).
A subword chain cluster is a group to which a plurality of subword chains belong, and may be a single subword chain.
Note that phonemes may be used instead of syllables. In this case, the syllable string cluster creating means 11 becomes the phoneme string cluster creating means 11 and creates a plurality of phoneme N-Grams by combining a preset number (N) of phonemes, and between each phoneme N-Gram. A group to which a plurality of phonemes N-Gram clustered on the basis of the distances may be registered as a phoneme string cluster.
The distance between each syllable N-Gram (between each phoneme N-Gram) is a character that is determined by comparing the phonemes that make up the N syllables that make up each syllable N-Gram (the number of phonemes may be different) It is calculated by column similarity calculation (DP matching). The DP matching will be described later.

検索用INDEX作成手段１２は、登録された複数の音節列クラスタと歌詞データベース４の各歌詞データの音素列との距離をテーブルとした検索用INDEXを予め作成する。各音節列クラスタは、それぞれ複数の音節N-Gramを代表する音節N-Gramである。検索用INDEXには、予め算出された各音節列クラスタに対する各歌詞データの距離が記憶されている。検索用INDEXの作成手順については後述する。 The search index creation means 12 creates in advance a search index using a table of distances between a plurality of registered syllable string clusters and phoneme strings of each lyrics data in the lyrics database 4. Each syllable string cluster is a syllable N-Gram representing a plurality of syllable N-Grams. In the search INDEX, the distance of each lyric data with respect to each syllable string cluster calculated in advance is stored. The procedure for creating the search INDEX will be described later.

サブワード連鎖抽出手段１３は、入力された検索文字列を、Ｎ個の音節毎に分けて複数のサブワード連鎖として抽出する。例えば、検索文字列が８個の音節で構成され、Ｎを３個とした場合、それぞれ先頭から１，２，３番目の音節、２，３，４番目の音節、３，４，５番目の音節、４，５，６番目の音節、５，６，７番目の音節、６，７，８番目の音節（音節3-Gram）がサブワード連鎖として抽出される。
前記音素列クラスタ作成手段１１に応じて、サブワード連鎖抽出手段１３では、音節の代わりに音素を使用し、検索文字列をＮ個の音節毎に分けて複数のサブワード連鎖（音素N-Gram）として抽出する。
サブワード連鎖（音節列又は音素列）は、複数のサブワード（音節又は音素）を繋げた組み合わせで作成されるが、一つのサブワード（音節又は音素）で構成されるものであってもよい。 The subword chain extraction means 13 extracts the input search character string as a plurality of subword chains divided into N syllables. For example, if the search string is composed of 8 syllables and N is 3, the first, second, third syllable, second, third, fourth syllable, third, fourth, fifth The syllable, the fourth, fifth and sixth syllables, the fifth, sixth and seventh syllables, and the sixth, seventh and eighth syllables (syllable 3-Gram) are extracted as subword chains.
In response to the phoneme string cluster creation means 11, the subword chain extraction means 13 uses phonemes instead of syllables, and divides the search character string into N syllables as a plurality of subword chains (phoneme N-Gram). Extract.
A subword chain (syllable string or phoneme string) is created by combining a plurality of subwords (syllable or phoneme), but may be composed of a single subword (syllable or phoneme).

類似度演算手段１４による検索文書は、検索用INDEXを参照して検索文字列に対する歌詞データ候補を絞り込む第１パスINDEX検索と、この歌詞データ候補と検索文字列との音素列同士のＤＰマッチングを行って最終的な歌詞データの検索を行う第２パスＤＰマッチング検索とにより行われる。
第１パスINDEX検索は、前記した検索用INDEXを参照して抽出した各サブワード連鎖（音節3-Gram）に対応するサブワード連鎖クラスタ（音節列クラスタ）の距離の合計から各歌詞データの音素列との類似度を演算することで行われる。すなわち、各サブワード連鎖が属するクラスタに対応する各距離を検索用INDEXから求め、各距離の合計を各検索対象文書（各歌詞データの音素列）に対する類似度として演算する。
第２パスＤＰマッチング検索は、歌詞データ候補と検索文字列との音素列同士のＤＰマッチングにより類似度を演算して最終的な歌詞データの検索を行う。 The search document by the similarity calculation means 14 refers to the first pass INDEX search that narrows down the lyrics data candidates for the search character string with reference to the search INDEX, and DP matching between the phoneme strings of the lyrics data candidate and the search character string. This is performed by a second pass DP matching search that is performed to search for final lyrics data.
In the first pass INDEX search, the phoneme string of each lyric data is calculated from the sum of the distances of the subword chain clusters (syllable string clusters) corresponding to each subword chain (syllable 3-Gram) extracted with reference to the above-described search index. This is done by calculating the similarity of. That is, each distance corresponding to the cluster to which each subword chain belongs is obtained from the search INDEX, and the total of each distance is calculated as the similarity to each search target document (phoneme string of each lyric data).
In the second pass DP matching search, the final lyrics data is searched by calculating the similarity by DP matching between phoneme strings of the lyrics data candidate and the search character string.

出力インタフェース３は、類似度演算手段１４での演算結果に基づいて歌詞データ（検索対象文書）の検索結果を出力する。検索結果は、入力インタフェース２に入力される検索文字列との類似度の高い順で出力される。 The output interface 3 outputs the search result of the lyrics data (search target document) based on the calculation result in the similarity calculation means 14. Search results are output in descending order of similarity to the search character string input to the input interface 2.

また、出力インタフェースでは、検索結果となる歌詞の楽曲名、アーティスト名、歌詞全文が提示されるようにしてもよい。あるいは、出力インタフェースは、歌詞全文がリンクして表示され、表示する際に、クエリとマッチングした歌詞部分について、異なる色で表示したり、大きな文字で表記したりする等で強調するようなハイライト処理を行うようにしてもよい。 In addition, the output interface may present the song name, artist name, and full lyrics of the lyrics that are the search results. Alternatively, the output interface highlights the entire lyrics that are linked and displayed, and highlights the lyrics that match the query by displaying them in different colors or in large letters. Processing may be performed.

次に、第１パスINDEX検索における検索文字列に対する歌詞データの検索時に使用する検索用INDEXの作成手順について、図２を参照しながら説明する。検索用INDEXは、音節列クラスタ作成手段１１及び検索用INDEX作成手段１２に対して予め入力される情報により作成される。 Next, a procedure for creating a search INDEX used when searching lyrics data for a search character string in the first pass INDEX search will be described with reference to FIG. The search INDEX is created from information input in advance to the syllable string cluster creation means 11 and the search INDEX creation means 12.

先ず、歌詞データや新聞コーパス等の大量の文書データをサブワード連鎖抽出手段１３に入力し、所望の数Ｎを指定して形態素解析を行って音節列に変換することで、文書データに含まれる音節Ｎ個から構成された全ての音節列を音節N-GramＡ（ｉ）として算出する。
例えば、図２に示すように、文書データに文字列「好きな事がない」が存在しサブワード連鎖抽出手段１３に入力された場合（ステップ２１）、音節変換処理を施すことで音節列「su-ki-na-ko-to-ga-na-i」を求め（ステップ２２）、所望の数Ｎを３とした場合、これを３音節で構成される複数の音節N-Gram（音節3-Gram）Ａ（ｍ）に変換する。音節N-GramＡ（1〜6）は、それぞれ「su-ki-na」「ki-na-ko」「na-ko-to」「ko-to-ga」「to-ga-na」「ga-na-i」に変換される（ステップ２３）。 First, a large amount of document data such as lyric data and newspaper corpus is input to the sub-word chain extraction means 13, morphological analysis is performed by designating a desired number N, and converted into a syllable string, whereby syllables included in the document data All syllable strings composed of N are calculated as syllable N-GramA (i).
For example, as shown in FIG. 2, when the character string “There is nothing you like” exists in the document data and is input to the subword chain extraction means 13 (step 21), the syllable string “su” is obtained by performing syllable conversion processing. -ki-na-ko-to-ga-na-i "(step 22). If the desired number N is 3, a plurality of syllables N-Gram (syllable 3- Gram) Convert to A (m). Syllable N-GramA (1-6) are "su-ki-na", "ki-na-ko", "na-ko-to", "ko-to-ga", "to-ga-na", "ga-" na-i "(step 23).

次に、作成された全ての音節N-GramＡ（ｍ）について、音節N-Gram同士の距離行列を計算する（ステップ２４）。音節N-Gram同士間の距離の計算ついては、二つの音節N-Gramをそれぞれ音素列に変換し、二つの音素列間の音響距離を求めることになる。音素列間の音響距離の計算の仕方については、二つの音素列において、音素を単位とするＤＰマッチング（系列になっているデータ同士の類似度の計算）を行う。 Next, a distance matrix between syllable N-Grams is calculated for all the created syllable N-Gram A (m) (step 24). Regarding the calculation of the distance between the syllable N-Grams, the two syllable N-Grams are converted into phoneme strings, respectively, and the acoustic distance between the two phoneme strings is obtained. As for the method of calculating the acoustic distance between phoneme strings, DP matching (calculation of similarity between series of data) is performed in units of phonemes in two phoneme strings.

ＤＰマッチングとは、既存の技術である動的計画法による文字列類似度計算であり、二つの音素列間の音響類似距離値を求めるに際し、音素間の差違や音素の挿入や脱落を考慮した計算手段により計算する。ＤＰマッチングの具体的な計算例については後述する。 DP matching is a string similarity calculation based on dynamic programming, which is an existing technology. When calculating the acoustic similarity distance between two phoneme strings, the difference between phonemes and the insertion or deletion of phonemes are taken into account. Calculate by calculation means. A specific calculation example of DP matching will be described later.

図２のフローチャートに戻り、ＤＰマッチングにより各音節間の音響距離に基づいてクラスタリングした複数の音節N-Gramが属するグループを音節列クラスタＣ（ｎ）として登録する（ステップ２５）。
複数の音節N-GramＡ（ｍ）のクラスタリングは、次のような手順で行われる。
音節N-Gram同士の距離行列で構成される各音響距離値の計算（音節N-Gram同士間の距離の計算）については、上述したように、二つの音節N-Gramをそれぞれ音素列に変換し、二つの音素列間の音響距離を求める。
そして、各音節N-Gramを音響距離値空間上にプロットし、距離の近い音響距離値同士で空間をクラスタリングして、各クラスタの代表ベクトルとなる音節列クラスタＣ（ｎ）を算出する。音節列クラスタＣ（ｎ）は、複数（クラスタにより異なる一定でない数）の音節N-GramＡ（ｍ）を代表する。 Returning to the flowchart of FIG. 2, a group to which a plurality of syllable N-Grams clustered based on the acoustic distance between syllables by DP matching belongs is registered as a syllable string cluster C (n) (step 25).
Clustering of a plurality of syllables N-GramA (m) is performed in the following procedure.
As for the calculation of each acoustic distance value composed of the distance matrix between syllable N-Grams (calculation of the distance between syllable N-Grams), each syllable N-Gram is converted into a phoneme string as described above. Then, the acoustic distance between the two phoneme strings is obtained.
Then, each syllable N-Gram is plotted on the acoustic distance value space, and the space is clustered with the acoustic distance values that are close to each other to calculate a syllable string cluster C (n) that is a representative vector of each cluster. The syllable string cluster C (n) represents a plurality of (non-constant numbers different depending on the clusters) syllable N-GramA (m).

クラスタリングの具体的な実現手段としては、公知技術であるk-meansを用いることができる。k-means によるクラスタリングは、以下の（１）〜（４）の手順により行われる。
（１）データを指定された任意の数であるk個のクラスタに分割する。例えば、音節列クラスタＣ（ｎ）の個数ｎを１０００個とする。
（２）各クラスタについて重心を計算する。
（３）全てのデータについて、重心との距離を最小にするクラスタを求め、各データを最小のクラスタに割り当てる。
（４）前回のクラスタから変化がなければ終了する。変化がある場合は、（２）に戻る。 As a specific means for realizing clustering, k-means, which is a known technique, can be used. Clustering by k-means is performed by the following procedures (1) to (4).
(1) The data is divided into k clusters, which are an arbitrary number. For example, the number n of syllable string clusters C (n) is 1000.
(2) Calculate the centroid for each cluster.
(3) A cluster that minimizes the distance from the center of gravity is obtained for all data, and each data is assigned to the smallest cluster.
(4) End if there is no change from the previous cluster. If there is a change, return to (2).

上述した例では、文書データをサブワード連鎖抽出手段１３に入力して音節列に変換することで音節N-Gram（音節3-Gram）Ａ（ｍ）を作成したが、文書データを入力することなく得ることもできる。すなわち、音節列クラスタ作成手段１１において、５０音及び「しゃ」「じゃ」等の濁音や促音を含んだ多数の音素（最大で１２０〜１３０程度存在する）から所望（例えば３音）の数（Ｎ個）の音素を繋げた全ての音節N-Gram（音節3-Gram）Ａ（ｍ）を作成すればよい。この場合、例えば、「ア」「イ」「ウ」や「シャ」「カ」「イ」のような３音の音素の連結を単位として全ての音節N-GramＡ（ｍ）を作成する。音素として１００音を選択すれば、１００の３乗個の音節N-Gram（音節3-Gram）Ａ（ｍ）が存在することになる。 In the above example, the syllable N-Gram (syllable 3-Gram) A (m) is created by inputting the document data to the subword chain extraction means 13 and converting it into a syllable string. However, without inputting the document data. It can also be obtained. That is, in the syllable string cluster creating means 11, a desired number (for example, three sounds) from a large number of phonemes (including about 120 to 130 at the maximum) including 50 sounds and muddy sounds such as “sha” and “ja” and prompt sounds (for example, 3 sounds). All syllables N-Gram (syllable 3-Gram) A (m) connecting N phonemes may be created. In this case, for example, all syllables N-GramA (m) are created with a unit of three phonemes such as “A”, “I”, “U”, “Sha”, “K”, and “I”. If 100 sounds are selected as phonemes, there are 100 cubed syllables N-Gram (syllable 3-Gram) A (m).

音節N-Gram同士間の距離に基づいたk-means法を用いてクラスタリングすることで、複数の音節N-Gramを代表する音節N-Gramが音節列クラスタＣ（ｎ）として自動的に分類され、この音節列クラスタＣ（ｎ）と歌詞データベース４に登録された検索対象文書である複数の歌詞データＬ（ｋ）の音素列との各音響距離値を算出し、図３に示すような検索用INDEXのテーブル３０を予め作成しておく。 By clustering using the k-means method based on the distance between syllable N-Grams, syllable N-Grams representing multiple syllable N-Grams are automatically classified as syllable string clusters C (n). Then, each acoustic distance value between the syllable string cluster C (n) and the phoneme strings of the plurality of lyrics data L (k) which are search target documents registered in the lyrics database 4 is calculated, and the search as shown in FIG. The INDEX table 30 is created in advance.

音節列クラスタＣ（ｎ）と歌詞データＬ（ｋ）の音素列との類似度は、ＤＰマッチングで算出する。この場合、音節列クラスタＣ（ｎ）の音素数に対して、歌詞データＬ（ｋ）の音素数が多くなるので、二つの音素列の長さが異なるため、端点フリーＤＰマッチングを利用する。端点フリーＤＰマッチングを使用することで、音節列クラスタＣ（ｎ）を構成する音素の連続と歌詞データの最も長い共通部分列を算出し、その共通部分の類似度を算出できる。すなわち、端点フリーＤＰマッチングは、二つの音素列において音素を単位とし、音素の挿入や削除、置換のコストを考慮して文字列類似度計算が行われる。また、文字列類似度計算において、二つの音素列の中で異なる音素の数（置換された音素のみ）に基づいて類似度計算を行ってもよい。 The similarity between the syllable string cluster C (n) and the phoneme string of the lyrics data L (k) is calculated by DP matching. In this case, since the number of phonemes in the lyric data L (k) is larger than the number of phonemes in the syllable string cluster C (n), the lengths of the two phoneme strings are different, so that end point free DP matching is used. By using the end point free DP matching, it is possible to calculate the longest common subsequence of the phoneme series and the lyric data constituting the syllable sequence cluster C (n), and the similarity of the common portion can be calculated. That is, in the end point free DP matching, character string similarity is calculated in consideration of the cost of insertion, deletion, and replacement of phonemes in units of phonemes in two phoneme strings. In the character string similarity calculation, the similarity calculation may be performed based on the number of different phonemes in the two phoneme strings (only the replaced phonemes).

端点フリーＤＰマッチングによる具体的な計算方法について説明する。二つの音素列Ｓ１，Ｓ２が存在し、ＤＰマッチング計算における挿入、脱落、置換のコスト値をそれぞれＣins，Ｃdel,Ｃsubとした場合、ＤＰマッチング計算式の初期値は数１、計算値は数２、結果値は数３でそれぞれ表わされる。 A specific calculation method by end point free DP matching will be described. If there are two phoneme sequences S1 and S2, and the cost values for insertion, omission, and replacement in DP matching calculation are Cins, Cdel, and Csub, respectively, the initial value of the DP matching calculation formula is Equation 1, and the calculation value is Equation 2 The result value is expressed by Equation (3).

検索用音素列Ｓ１が「ｔｏｇａｎａｉ」であり、検索対象音素列Ｓ２が「…ｋｏｔｏｗａｎａｎｉａ…」である場合を例に、ＤＰ距離の計算処理結果を図４に示す。この場合において、計算する際の単純化を図るため、Ｃins＝１，Ｃdel＝１,Ｃsub＝１とした。実際には、後述する音素混同行列によって各音素に対してＣins，Ｃdel,Ｃsubのコスト値が算出される。 FIG. 4 shows the DP distance calculation processing result, taking as an example the case where the search phoneme string S1 is “toganai” and the search target phoneme string S2 is “... In this case, in order to simplify the calculation, Cins = 1, Cdel = 1, and Csub = 1. Actually, cost values of Cins, Cdel, and Csub are calculated for each phoneme by a phoneme confusion matrix described later.

先ず、数１の初期化計算式により検索対象音素列Ｓ２の各値の初期化を行う（STEP１）。次に、検索用音素列Ｓ１の「ｔ」に対する検索対象音素列Ｓ２の各要素の計算値を数２により計算する（STEP２）。数２では、４つの式のうちの最小値が選択される。
この計算を、検索用音素列Ｓ１の「ｏ」「ｇ」「ａ」「ｎ」「ａ」「ｉ」に対して順次行い（STEP３〜８）、全てのＤ（i.j）が算出される。 First, each value of the search target phoneme string S2 is initialized by the initialization calculation formula 1 (STEP 1). Next, the calculated value of each element of the search target phoneme string S2 with respect to “t” of the search phoneme string S1 is calculated by Equation 2 (STEP 2). In Equation 2, the minimum value among the four equations is selected.
This calculation is sequentially performed on “o”, “g”, “a”, “n”, “a”, and “i” of the search phoneme string S1 (STEPs 3 to 8), and all D (ij) are calculated.

Ｄ（len(S1),j）の最小値を終点とし、終点からバックトレースすることで、Ｄ（i,j）を最小値（最短距離）となる最適経路を導き出す（STEP９）。
その結果、検索用音素列Ｓ１「ｔｏｇａｎａｉ」は、検索対象音素列Ｓ２「…ｋｏｔｏｗａｎａｎｉａ…」の「ｔｏｗａｎａ」と類似し、ＤＰ距離は「２」となる。 By using the minimum value of D (len (S1), j) as the end point and back-tracing from the end point, an optimum route with D (i, j) having the minimum value (shortest distance) is derived (STEP 9).
As a result, the search phoneme string S1 “toganai” is similar to “towana” of the search target phoneme string S2 “...

検索用INDEXのテーブル３０では、図３に示すように、算出された各音節列クラスタを縦に配置し、各歌詞データを横に配置し、予め計算されたクラスタ音節と歌詞データの音素列同士の類似度が対応箇所に記載された表で構成されている。 In the search index table 30, as shown in FIG. 3, the calculated syllable string clusters are arranged vertically, the lyrics data are arranged horizontally, and the pre-calculated cluster syllable and the phoneme strings of the lyric data are exchanged. The degree of similarity is composed of a table described in the corresponding part.

ＤＰマッチングの計算においては、音素の置換コスト値（異なる音素間の置換コスト値）、音素挿入コスト値、音素脱落コスト値をそれぞれ予め数値を与え、音素列間の類似度を表す累積ＤＰマッチングコストを求める。この累積ＤＰマッチングコストを音素列間の音響距離値とする。上述した音素の置換、挿入及び脱落コストの数値については、音声認識実験による音素混同行列（phonetic confusion matrix）に基づいて計算される。
前記挿入コスト値及び脱落コスト値は、音声認識実験における経験値や発音辞書により決められる値である。 In the DP matching calculation, the replacement cost value of phonemes (replacement cost value between different phonemes), the phoneme insertion cost value, and the phoneme dropout cost value are given in advance, and the accumulated DP matching cost representing the similarity between phoneme strings. Ask for. This accumulated DP matching cost is set as an acoustic distance value between phoneme strings. The above phoneme replacement, insertion, and dropout values are calculated based on a phonetic confusion matrix from a speech recognition experiment.
The insertion cost value and dropout cost value are values determined by experience values and pronunciation dictionaries in speech recognition experiments.

音素混同行列とは、音声認識実験などにより求められるもので、行列の要素を確率で表したものである。音声認識実験は、音声データを学習データとして大量に集め、ＨＭＭ学習アリゴリズムによって学習データを利用して音声認識用の音素の音響モデルを作成する。 The phoneme confusion matrix is obtained by a speech recognition experiment or the like, and represents a matrix element with a probability. In the speech recognition experiment, a large amount of speech data is collected as learning data, and an acoustic model of phonemes for speech recognition is created using the learning data by an HMM learning algorithm.

また、音素間混同行列を使用した各音素の挿入、脱落及び他の音素との置換コストの算出例について、表１を参照して説明する。この場合、テスト用音声データを音声認識器に入力し、認識結果となった音素列を出力し、テスト用音声データを書き起こした正解音素列と前記認識結果となった音素列を比較することで音素間混同行列が作成される。
表１は、ａ，ｂ…ｚの各音素についての置換、脱落、挿入の回数を示すもので、ｎ（ｐ,ｋ）は音声認識器によって音素ｐを音素ｋとして認識した回数、ｎ（φ,ｐ）は音声認識器によって音素ｐが音声認識結果に挿入された（正解音素列には存在しない）回数、ｎ（ｐ,φ）は音声認識器によって音素ｐが脱落した回数をそれぞれ示している。 Further, an example of calculating the insertion cost of each phoneme using the interphoneme confusion matrix and the replacement cost with other phonemes will be described with reference to Table 1. In this case, the test speech data is input to the speech recognizer, the phoneme sequence that is the recognition result is output, and the correct phoneme sequence that transcribes the test speech data is compared with the phoneme sequence that is the recognition result. Creates a phoneme confusion matrix.
Table 1 shows the number of substitutions, omissions, and insertions for each phoneme of a, b... Z, where n (p, k) is the number of times the phoneme p is recognized as the phoneme k by the speech recognizer, n (φ , p) indicates the number of times the phoneme p has been inserted into the speech recognition result by the speech recognizer (does not exist in the correct phoneme string), and n (p, φ) indicates the number of times the phoneme p has been dropped by the speech recognizer. Yes.

ＤＰマッチング計算における挿入、脱落、置換のコスト値（Ｃins，Ｃdel,Ｃsub）は、Ｍを音素の集合とした場合、数４〜６により求められる。 The cost values (Cins, Cdel, Csub) for insertion, omission, and substitution in DP matching calculation are obtained by Equations 4 to 6, where M is a set of phonemes.

次に、検索文字列を入力して類似する歌詞データを検索する場合について、図５のフローチャートを参照して説明する。
先ず、ユーザが歌詞の一部であるとして記憶しているフレーズ（聞き覚えのある単語の集合である文字列。例えば「好きな事がない」）を検索文字列として入力インタフェース２から入力する（ステップ４１）。前記したフレーズ「好きな事がない」は、実際の歌詞に含まれている音節と完全に一致していなくてもよい。 Next, the case of searching for similar lyrics data by inputting a search character string will be described with reference to the flowchart of FIG.
First, a phrase (a character string that is a set of familiar words. For example, “nothing you like”) that the user has stored as part of the lyrics is input as a search character string from the input interface 2 (steps). 41). The phrase “I don't like it” does not have to completely match the syllable included in the actual lyrics.

入力されたフレーズは、サブワード連鎖抽出手段１３により、フレーズを構成する文字から音節変換処理が行われる（ステップ４２）。具体的には、フレーズ「好きな事がない」を音節列である「su-ki-na-ko-to-ga-na-i」に変換する。 The input phrase is subjected to syllable conversion processing from the characters constituting the phrase by the subword chain extraction means 13 (step 42). Specifically, the phrase “I don't like it” is converted into a syllable string “su-ki-na-ko-to-ga-na-i”.

続いて、指定された数Ｎ（この例では「３」）の音節数から構成される複数の音節N-Gram（音節3-Gram）Ｑに変換する（ステップ４３）。
音節N-GramＱは、上記例の音節列「su-ki-na-ko-to-ga-na-i」に対して、Ｑ（１）「su-ki-na」，Ｑ（２）「ki-na-ko」，Ｑ（３）「na-ko-to」，Ｑ（４）「ko-to-ga」，Ｑ（５）「to-ga-na」，Ｑ（６）「ga-na-i」の６個に変換される。 Subsequently, it is converted into a plurality of syllables N-Gram (syllable 3-Gram) Q composed of a specified number N (“3” in this example) (step 43).
The syllable N-GramQ is Q (1) “su-ki-na”, Q (2) “ki” for the syllable string “su-ki-na-ko-to-ga-na-i” in the above example. -na-ko ", Q (3)" na-ko-to ", Q (4)" ko-to-ga ", Q (5)" to-ga-na ", Q (6)" ga-na -i "is converted into six.

次に、歌詞データＬ（ｋ）に対する類似度Ｒ（ｋ）を検索用INDEXのテーブル３０を使用し、値が小さい上位複数個（Ｐ個）の歌詞データを候補として選択する第１パスINDEX検索が行われる（ステップ４４）。
この検索は、図６に示すように、検索文字列のフレーズに対応する各音節N-GramＱが含まれる音節列クラスタＣと、歌詞データの音節列との距離の総和である類似度Ｒ（ｋ）＝Ｄ（35,k）＋…＋Ｄ（n,k）＋…＋Ｄ（1000,k））をテーブルに記憶された値に基づいて計算することで求める。 Next, the first pass INDEX search for selecting a plurality of (P) lyric data having a smaller value as candidates using the index 30 for searching for the similarity R (k) to the lyric data L (k). Is performed (step 44).
As shown in FIG. 6, this search is performed by using the similarity R (k) that is the sum of the distances between the syllable string cluster C including each syllable N-GramQ corresponding to the phrase of the search character string and the syllable string of the lyrics data. ) = D (35, k) +... + D (n, k) +... + D (1000, k)) is calculated based on the values stored in the table.

具体的は、Ｑ（１）「su-ki-na」が含まれる音節列クラスタがＣ（35）であり、Ｑ（ｍ）「na-ko-to」が含まれる音節列クラスタがＣ（n）であり、Ｑ（６）「ga-na-i」が含まれる音節列クラスタがＣ（1000）である場合、これを加算した値が類似度Ｒ（ｋ）となる。
そして、第１パスINDEX検索結果として、図７に示すように、各歌詞データに対する類似度Ｒの値の低い順で歌詞データをランキングし、その上位Ｐ個（例えば８００個）の歌詞データＬ（ｋ）を検索対象候補として絞り込む。歌詞データの絞り込み個数であるＰは、可変にできるようになっている。 Specifically, the syllable string cluster including Q (1) “su-ki-na” is C (35), and the syllable string cluster including Q (m) “na-ko-to” is C (n). ) And the syllable string cluster including Q (6) “ga-na-i” is C (1000), the value obtained by adding these is the similarity R (k).
Then, as a first pass INDEX search result, as shown in FIG. 7, the lyric data is ranked in descending order of the similarity R value for each lyric data, and the top P (for example, 800) lyric data L ( k) is narrowed down as a search target candidate. P, which is the number of lyrics data to be narrowed, can be made variable.

続いて、検索文字列の音素列と歌詞データの音素列との音響距離が最も小さい歌詞データを決めるため第２パスＤＰマッチング検索が行われる（ステップ４５）。この検索は、検索文字列の音素列と、第１パスINDEX検索によって検索対象候補となった歌詞データの音素列との類似度Ｒ´をＤＰマッチングで算出する。そして、ＤＰマッチングを行うに際しては、第１パスINDEX検索によって検索対象候補となったＰ（８００）個の歌詞データについて、上位からＭ（１０〜１００）個毎を一つのグループとする複数（Ｐ／Ｍ個の）セッションに分割し、セッション毎に各セッションに属する歌詞データに対してＤＰマッチングを行う。
以下、第２パスＤＰマッチング検索の手順について、図８のフローチャートを参照しながら説明する。 Subsequently, a second pass DP matching search is performed to determine the lyric data having the shortest acoustic distance between the phoneme string of the search character string and the phoneme string of the lyric data (step 45). In this search, the similarity R ′ between the phoneme string of the search character string and the phoneme string of the lyric data that is the search target candidate by the first pass INDEX search is calculated by DP matching. When performing DP matching, a plurality (P) of P (800) lyric data that are candidates for search by the first pass INDEX search are grouped by M (10 to 100) from the top. / M sessions), and DP matching is performed on the lyrics data belonging to each session for each session.
Hereinafter, the procedure of the second path DP matching search will be described with reference to the flowchart of FIG.

先ず、第１パスINDEX検索によって検索対象候補となったＰ（８００）個の歌詞データについて、Ｍ（１０〜１００）個を一つのグループとする複数（Ｐ／Ｍ）セッションに分割する（ステップ５１）。 First, P (800) lyric data that are candidates for search by the first pass INDEX search are divided into a plurality of (P / M) sessions with M (10 to 100) as one group (step 51). ).

続いて、上位候補のセッション順に順次各セッションに属する検索対象文書候補（歌詞データ）についてＤＰマッチングによる類似度の算出が行われるが、先ずセッション１に属するＭ個の各歌詞データＬ（ｋ）に対してそれぞれＤＰマッチングを行ってＭ個の各類似度を算出する（ステップ５２）。
そして、セッション内での全ての歌詞データに対するＤＰマッチングの計算終了時に、セッション内における類似度の最小値（Ｄ(min)）とＭ個の類似度の平均値（Ｄ(mean)）から、セッション類似度Ｄ＝Ｄ(min)／Ｄ(mean)を計算する（ステップ５３）。 Subsequently, similarities are calculated by DP matching for search target document candidates (lyric data) belonging to each session in order of the top candidate session. First, the M pieces of lyrics data L (k) belonging to session 1 are calculated. Then, DP matching is performed for each of them to calculate M similarities (step 52).
Then, at the end of the DP matching calculation for all the lyrics data in the session, from the minimum value (D (min)) of similarity in the session and the average value (D (mean)) of M similarities, Similarity D = D (min) / D (mean) is calculated (step 53).

次に、セッション類似度Ｄが所定の閾値α（０．４〜０．５）より小さいかどうかを判断する（ステップ５４）。
セッション類似度Ｄが所定の閾値αより小さい場合には、類似度が小さく検索文字列と似ている歌詞データがセッション内に存在すると判断し、以降のセッションでのＤＰマッチングによる類似度算出を打ち切り、その時点で算出された類似度により検索対象文書候補の出力順位を決め、上位Ｘ個を検索結果として出力する（ステップ５５）。
すなわち、ＤＰマッチング計算済のセッションにおいて算出された音響距離値（類似度）Ｒ´が低いＸ個について、歌詞データ及びそれらの楽曲情報を検索結果として出力インタフェース３に出力する。出力するＸ個は、ＤＰマッチング計算済である少なくとも１以上のセッションにおいて算出された全ての類似度に対応する歌詞データ及びそれらの楽曲情報でもよいし（その場合、ＸはＭの倍数となる）、その中の上位のものであってもよい。 Next, it is determined whether or not the session similarity D is smaller than a predetermined threshold value α (0.4 to 0.5) (step 54).
If the session similarity D is smaller than the predetermined threshold value α, it is determined that there is lyrics data in the session that has a low similarity and is similar to the search character string, and the similarity calculation by DP matching in the subsequent sessions is aborted The output order of search target document candidates is determined based on the similarity calculated at that time, and the top X are output as search results (step 55).
That is, for the X pieces having low acoustic distance values (similarity) R ′ calculated in the DP matching calculated session, the lyrics data and the music information thereof are output as search results to the output interface 3. The X pieces to be output may be lyric data corresponding to all similarities calculated in at least one session for which DP matching has been calculated and their music information (in this case, X is a multiple of M). , It may be a higher one.

セッション類似度Ｄが所定の閾値αより小さくない場合は、ＤＰマッチング計算済のセッションの番号に「１」を加算した次のセッションをＤＰマッチング計算対象とし（ステップ５６）、ステップ５２からステップ５４の処理が繰り返して行われる。 If the session similarity D is not smaller than the predetermined threshold α, the next session obtained by adding “1” to the number of the session for which DP matching has been calculated is set as the DP matching calculation target (step 56). The process is repeated.

上述の例では、セッション類似度Ｄ＝Ｄ(min)／Ｄ(mean)、閾値α＝０．４〜０．５とすることで、検索文字列と似ている歌詞データの存在の有無を判断したが、セッション類似度Ｄは、セッション内における類似度の最小値と平均値に基づいて算出される値であればよく、また、閾値αは、０．０〜１．０の範囲で所定値を決めればよい。例えばセッション内の標準偏差値をＤ(s)とした場合、以下に示す数式からセッション類似度Ｄを求めても良い。 In the above example, the presence / absence of lyrics data similar to the search character string is determined by setting the session similarity D = D (min) / D (mean) and the threshold α = 0.4 to 0.5. However, the session similarity D may be a value calculated based on the minimum value and the average value of the similarity in the session, and the threshold value α is a predetermined value in the range of 0.0 to 1.0. You can decide. For example, when the standard deviation value in the session is D (s), the session similarity D may be obtained from the following formula.

セッション類似度Ｄ＝Ｄ（s）／（Ｄ(mean)‐Ｄ(min)） Session similarity D = D (s) / (D (mean) −D (min))

上述した第２パスＤＰマッチング検索によれば、セッション内における類似度の最小値と平均値に基づいて算出されるセッション類似度Ｄ（例えば、Ｄ＝Ｄ(min)／Ｄ(mean)）が所定の閾値αより小さい場合に、類似度が小さく検索文字列と似ている歌詞データが存在すると判断し、以降のセッションでのＤＰマッチングによる類似度算出を打ち切るので、ＤＰマッチングを行う歌詞データの数を減少させることで、検索率を低下させることなく検出処理の速度向上を図ることができる。 According to the second path DP matching search described above, the session similarity D (for example, D = D (min) / D (mean)) calculated based on the minimum value and average value of the similarity in the session is predetermined. If the threshold value α is smaller than the threshold α, it is determined that there is lyric data that has a low similarity and is similar to the search character string, and the similarity calculation by DP matching in the subsequent sessions is terminated. By reducing, the detection processing speed can be improved without reducing the search rate.

また、上述したＤＰマッチングによる類似度算出の場合、検索文字列の音素の数と、歌詞データＬ（ｋ）の音素の数が異なるので、上述した検索用INDEXの作成時の音節列クラスタと歌詞データとの類似度の算出と同様に端点フリーＤＰマッチングを利用する。
例えば、入力した検索文字列が「好きな事がない」である場合に、歌詞データを構成する音素列との比較を行い、歌詞データに「好きな事が何」が含まれる場合、検索文字列の「好きな事がない」の音素列「sukinakotoganai」（１５音素）と、歌詞データ（１５音素以上）と端点フリーＤＰマッチングで比較し、歌詞データに含まれる最も類似する「好きな事は何」の音素列「sukinakotowanani」との類似度Ｒ´が算出される。 In addition, in the similarity calculation by DP matching described above, the number of phonemes in the search character string and the number of phonemes in the lyrics data L (k) are different, so the syllable string cluster and the lyrics at the time of creating the above-described search INDEX Similar to the calculation of the similarity to data, the endpoint free DP matching is used.
For example, if the input search string is "There is nothing you like", compare it with the phoneme string that makes up the lyrics data, and if the lyrics data contains "what you like" Compare the phoneme sequence “sukinakotoganai” (15 phonemes) of the “nothing you like” column with lyric data (15 phonemes or more) by end-point free DP matching, and the most similar “like thing” included in the lyric data The similarity R ′ to the “what” phoneme string “sukinakotowanani” is calculated.

上述した類似文書検索装置によれば、検索用INDEX作成手段１２におけるテーブル３０の作成段階において、各音節列クラスタＣ（ｎ）と歌詞データ（検索対象文書）Ｌ（ｋ）との類似度を事前に計算しておくため、検索文字列の検索時における第１パスINDEX検索での歌詞データの絞り込み処理にはテーブル３０の類似度を加算するだけで類似度のＤＰマッチングによる計算が不要となる。したがって、検索文字列と検索対象文書との音素間同士で行われる計算量の多いＤＰマッチングは、少量の検索対象候補に限定して行う第２パスＤＰマッチング検索のみで行われるため、全体の計算量を減少させることで高速検索を実現することができる。 According to the similar document search device described above, the similarity between each syllable string cluster C (n) and the lyric data (search target document) L (k) is determined in advance at the stage of creating the table 30 in the search INDEX creating means 12. Therefore, the calculation of the lyrics data in the first pass INDEX search at the time of search of the search character string only needs to add the similarities in the table 30, and the calculation by the DP matching of the similarities becomes unnecessary. Therefore, since DP matching with a large amount of calculation performed between phonemes of the search character string and the search target document is performed only by the second pass DP matching search performed only for a small amount of search target candidates, the entire calculation is performed. High-speed search can be realized by reducing the amount.

また、検索用INDEX作成手段１２におけるテーブル３０の作成時に、検索文字列のサブワード連鎖となる可能性がある音節N-Gramが含まれる音節列クラスタと歌詞データ（検索対象文書）の類似度を考慮するため、第１パスINDEX検索において正解となる歌詞データ（検索対象文書）を候補から外してしまう確率を低下させることができる。この結果、全検索対象に対するＤＰマッチングによる検索と同程度の検索精度が実現できる。 In addition, when the table 30 is created in the search index creation means 12, the similarity between the syllable string cluster including the syllable N-Gram that may be a subword chain of the search character string and the lyric data (search target document) is considered. Therefore, it is possible to reduce the probability that the lyrics data (search target document) that is the correct answer in the first pass INDEX search is excluded from the candidates. As a result, it is possible to realize the same search accuracy as the search by DP matching for all search targets.

また、検索用INDEX作成手段１２におけるテーブル３０の作成時、及び、検索文字列を入力して類似文書検索を行い際に、単語の代わりに、音節N-Gram（サブワード連鎖）を利用することによって、検索文字列と歌詞データ（検索対象文書）との間で対応する単語が存在しない（OUT-OF-VOCABULRY）という問題が生じることがなくなる。
さらに、検索用INDEX作成手段１２におけるテーブル３０の作成時に、クラスタリングによって検索用INDEXへのエントリ数が（音節N-Gramの数から音節列クラスタの数になったため）減るので、検索用INDEX作成のための音節N-Gramの種類が多い場合でも検索用INDEXサイズを削減することができる。 In addition, by using a syllable N-Gram (subword chain) instead of a word when creating the table 30 in the search INDEX creating means 12 and performing a similar document search by inputting a search character string The problem of no corresponding word (OUT-OF-VOCABULRY) between the search character string and the lyric data (search target document) does not occur.
Further, when the table 30 is created in the search INDEX creation means 12, the number of entries in the search INDEX is reduced by clustering (because the number of syllable N-Grams is reduced to the number of syllable string clusters). Therefore, even when there are many types of syllable N-Grams, the search index size can be reduced.

上述した類似文書検索装置では、類似度演算手段１４において類似度を考慮して検索文字列から歌詞データ（検索対象文書）を検索する場合、第１パスINDEX検索での歌詞データの絞り込み処理と、検索文字列と検索対象文書との音素間同士のＤＰマッチングによる第２パスＤＰマッチング検索処理を行っている。そして、第１パスINDEX検索での歌詞データの絞り込み個数Ｐは変数となっているので、検索精度を高める場合は個数Ｐを多くして第２パスＤＰマッチング検索処理を行い、検索処理時間を短縮する場合には個数Ｐを少なくして第２パスＤＰマッチング検索処理を行うように個数Ｐを設定すればよい。 In the similar document search device described above, when searching for lyric data (search target document) from the search character string in consideration of the similarity in the similarity calculation means 14, the lyric data narrowing-down process in the first pass INDEX search, Second pass DP matching search processing is performed by DP matching between phonemes of the search character string and the search target document. Since the number P of lyrics data in the first pass INDEX search is a variable, if the search accuracy is to be increased, the number P is increased and the second pass DP matching search process is performed to shorten the search processing time. In this case, the number P may be set so that the number P is reduced and the second pass DP matching search process is performed.

１…類似文書検索装置、２…入力インタフェース、３…出力インタフェース、４…歌詞データベース、１０…制御部、１１…音節列クラスタ生成手段、１２…検索用INDEX作成手段、１３…サブワード連鎖抽出手段、１４…類似度演算手段。 DESCRIPTION OF SYMBOLS 1 ... Similar document search apparatus, 2 ... Input interface, 3 ... Output interface, 4 ... Lyrics database, 10 ... Control part, 11 ... Syllable string cluster production | generation means, 12 ... Index creation means for search, 13 ... Subword chain extraction means, 14: Similarity calculation means.

Claims

In a search device comprising an input interface for inputting a character string or a character string converted by voice recognition as a search character string, and searching for a search target document including a portion similar to the search character string from a plurality of search target documents,
Create a syllable string or phoneme string (hereinafter referred to as “subword chain”) formed by combining a predetermined number of subwords from a plurality of syllables or phonemes (hereinafter referred to as “subwords”) in a plurality of documents . , A group to which a plurality of subword chains clustered based on the distance between each subword chain belongs is registered as a syllable string cluster or a phoneme string cluster (hereinafter referred to as “subword chain cluster”) , and a plurality of registered subword chain clusters are registered. And a search index creation means for creating a similarity table corresponding to each of the subword chain clusters in advance as a search index;
Subword chain extraction means for extracting a plurality of syllable strings or phoneme strings connecting the desired numbers from the syllables or phonemes constituting the search character string input to the input interface;
For each subword chain extracted by the subword chain extraction means, each similarity corresponding to a subword chain cluster to which each subword chain belongs is obtained from the search INDEX, and the total of the similarities is calculated as a similarity to each search target document. Similarity calculation means for calculating as
An output interface for outputting a search result of document data based on the similarity,
The similarity calculation means performs a first pass INDEX search for narrowing down a plurality of search target document candidates in descending order of the similarity corresponding to the subword chain cluster to which each of the subword chains found from the search INDEX belongs. When,
A second output order is determined based on a similarity obtained by converting the plurality of search target document candidates into a phoneme string, and calculating a character string similarity with the phoneme string of the search character string as a unit. Perform a search with path string similarity calculation ,
The search by the second path character string similarity calculation is as follows:
The search target document candidates searched in the first pass INDEX search are divided into a plurality of sessions, each grouping M from the top as one group,
Calculate the string similarity for the search target document candidates belonging to each session sequentially in the order of the top candidate session, and calculate the similarity,
When the session similarity D calculated based on the minimum value and the average value of the similarity in the session is smaller than the predetermined threshold value α, the similarity calculation by the character string similarity calculation in the subsequent session is terminated,
A similar document search apparatus characterized in that an output order of search target document candidates is determined based on a similarity calculated at that time.

The session similarity D and the threshold α are set to the minimum value D (min) and the average value D (mean) of the similarity in the session.
Session similarity D = D (min) / D (mean),
Threshold value α = 0.0 to 1.0
The similar document search device according to claim 1.

When the standard deviation value in the session is D (s),
Session similarity D = D (s) / (D (mean) −D (min)),
The similar document search device according to claim 2.

The similarity in the INDEX creation means is obtained by converting a syllable string or a phoneme string representing each subword chain cluster and a search target document into a phoneme string, and performing character string similarity calculation in units of phonemes in the two phoneme strings. The similar document search device according to claim 1, wherein the similar document search device calculates the cumulative character string similarity value obtained in this way.

5. The character string similarity calculation between the phoneme strings is performed by calculating a phoneme replacement cost value, a phoneme insertion cost value, or a phoneme dropout cost value based on a phoneme confusion matrix by a speech recognition experiment. Similar document search device.

The similar document search device according to claim 5, wherein the replacement cost value, insertion cost value, and dropout cost value are values determined by an experience value or a pronunciation dictionary.

The similar document search device according to claim 1, wherein the subword chain cluster is a group to which a plurality of subword chains belong, and is only one subword chain.

The similar document search apparatus according to claim 1, wherein the input interface inputs a part of lyrics as text data.

2. The similar document search device according to claim 1, wherein the input interface inputs a part of lyrics by voice, and searches after converting the input voice into text data by a voice recognition technique.

The similar document search device according to claim 1, wherein the output interface presents a song name, an artist name, or a full text of lyrics as a search result.

The similar document search device according to claim 1, wherein the output interface displays the entire lyrics in a linked manner, and highlights the portion of the lyrics that matches the query when displayed.

The similar document search device according to claim 1, wherein the subword chain is created by a combination of a plurality of subwords .

The similar document search device according to claim 1, wherein the subword chain is composed of one subword .

The similar document search apparatus according to claim 1, wherein the character string similarity calculation uses an end point free DP matching method in consideration of the cost of insertion, deletion, or replacement of phonemes in units of two phoneme strings.

The similar document search device according to claim 1, wherein the character string similarity calculation is performed by a similarity calculation method based on the number of different phonemes in two phoneme strings.