JP3973572B2

JP3973572B2 - Data analysis apparatus and data analysis program

Info

Publication number: JP3973572B2
Application number: JP2003031078A
Authority: JP
Inventors: 克年大附
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-02-07
Filing date: 2003-02-07
Publication date: 2007-09-12
Anticipated expiration: 2023-02-07
Also published as: JP2004240848A

Description

【０００１】
【発明の属する技術分野】
この発明は、コンピュータシステムによって、コンテンツデータのストーリー構造を分析するデータ分析装置、データ分析方法及びデータ分析プログラムに関し、特に、映像／音声データ等からなるマルチメディアコンテンツデータのストーリー構造を分析するデータ分析装置、データ分析方法及びデータ分析プログラムに関する。
【０００２】
【従来の技術】
従来、映像や音声からなるマルチメディアコンテンツを検索可能な構成とする場合、そのコンテンツを、人手によって、情報のまとまりの単位（以下、「ストーリー構造」という。）に区分し、その区分情報をコンテンツ中に記述することが一般的であった。
しかし、このような人手によってストーリー構造を抽出するには多大な労力が必要であり、大量のコンテンツに対するストーリー構造を生成することは極めて困難であった。
【０００３】
このような中、コンテンツ中の音響信号を音声認識し、抽出した文字列に基づいて、このストーリー構造を自動的に抽出するビデオデータ検索支援方法が提唱されている（特許文献１参照。）。また、この文字列を意味的なまとまり単位に分割する方法としては、例えば、特許文献２に示す方法が開示されている。
【０００４】
【特許文献１】
特開平５−３４２２６３号公報
【特許文献２】
特開２００２−３４２３２４号公報
【０００５】
【発明が解決しようとする課題】
しかし、特許文献１において開示された方法は、コンテンツ中の音響信号を音声認識し、抽出した文字列に基づいてストーリー構造を抽出するため、コンテンツからの音声認識結果が得られず、文字列が抽出できなかった区間では、ストーリー構造が抽出できないという問題点がある。
また、この音声認識結果が得られなかった区間におけるストーリー構造の抽出を人手で行うには、多大な労力を必要とする。
この発明はこのような点に鑑みてなされたものであり、コンテンツから音声認識結果が得られない区間においても、ストーリー構造を容易に抽出することが可能なデータ分析装置を提供することを目的とする。
【０００６】
また、この発明の他の目的は、コンテンツから音声認識結果が得られない区間においても、ストーリー構造を容易に抽出することが可能なデータ分析方法を提供することである。
さらに、この発明の他の目的は、コンテンツから音声認識結果が得られなかった区間であっても、ストーリー構造を容易に抽出することができる機能を、コンピュータに実行させるためのデータ分析プログラムを提供することである。
【０００７】
【課題を解決するための手段】
この発明では上記課題を解決するために、コンテンツデータによって特定される文字列を、パターン認識処理によって抽出し、この抽出された文字列を、意味的にまとまりのある単位に分割し、その分割点の時系列的な位置（話題境界点）を特定する話題境界点データを抽出する。また、このコンテンツデータによって特定されるコンテンツの物理的な変化点の時系列的な位置（信号変化点）を特定する信号変化点データを抽出する。そして、この抽出された話題境界点データと、信号変化点データとを用い、コンテンツの内容的な境界点の時系列的な位置（ストーリー境界点）を特定する。
【０００８】
なお、この発明において、好ましくは、時系列的な幅を持った探索区間に属する話題境界点のスコアと、この探索区間に属する信号変化点のスコアとが、所定の条件を満たした場合に、この探索区間をストーリー境界候補区間とし、このストーリー境界候補区間ごとに抽出した、このストーリー境界候補区間に属する話題境界点或いは信号変化点を、ストーリー境界点とする。
また、この発明において、好ましくは、話題境界点のスコア及び信号変化点のスコアを、第１の重み係数で重み付けを行って合計した値が、所定のしきい値に達した場合に、この話題境界点或いは信号変化点が属する探索区間を、ストーリー境界候補区間とする。
【０００９】
さらに、この発明において、好ましくは、ストーリー境界候補区間ごとに、このストーリー境界候補区間に属する話題境界点及び信号変化点のスコアを、第２の重み係数で重み付けを行って比較し、この重み付け後のスコアが、このストーリー境界候補区中最大であった話題境界点或いは信号変化点を、ストーリー境界点とする。
また、この発明において、好ましくは、上述の第２の重み係数は、上述の第１の重み係数と異なる係数である。
さらに、この発明において、好ましくは、ストーリー境界候補区間に属する話題境界点及び信号変化点から、最も優先順位の高い話題境界点或いは信号変化点を、ストーリー境界点とする。
【００１０】
また、この発明において、好ましくは、話題境界点のスコアは、パターン認識処理時におけるスコアを用いて生成されたスコアである。
さらに、この発明において、好ましくは、話題境界点のスコアは、話題境界点データを抽出する際のスコアを用いて生成されたスコアである。
また、この発明において、好ましくは、信号変化点のスコアは、信号変化点データを抽出する際のスコアを用いて生成されたスコアである。
【００１１】
【発明の実施の形態】
以下、この発明の第１の実施の形態を、図面を参照して説明する。なお、以下では、まずこの形態の概略について説明した後、その詳細について説明を行っていく。
図１は、この形態におけるデータ分析装置１の概略を例示した概念図であり、図２は、そのデータ分析方法を説明するための概念図である。以下、この図１及び図２を用いて、この例のデータ分析装置１の構成、及びそのデータ分析方法の概略について説明を行っていく。
【００１２】
データ分析装置１は、例えば、パーソナルコンピュータに所定のプログラムを実行させることによって構成され、そのコンテンツデータ記憶手段２には、映像・音声からなる時系列（図２では、ｔ＝０〜ｎ）をもったコンテンツデータ１０が記憶されている。この例のデータ分析装置１によってデータ分析を行う場合、まず、文字列抽出手段３によってコンテンツデータ記憶手段２からコンテンツデータ１０を読み出し、これによって特定される文字列を、パターン認識処理によって抽出する。ここで、「パターン認識処理」とは、音声や映像をパターン認識し、文字列を抽出する処理をいう（例えば、音声認識処理や文字認識処理等）。抽出された文字列の情報は、話題境界点抽出手段４に送られ、そこで、この抽出された文字列を、意味的にまとまりのある単位に分割し、その分割点の時系列的な位置（話題境界点１１ａ〜１１ｅ）を特定する話題境界点データ１１を抽出する。
【００１３】
また、この例では、信号変化点抽出手段５によってコンテンツデータ記憶手段２からコンテンツデータ１０を読み出し、このコンテンツデータ１０によって特定されるコンテンツの物理的な変化点の時系列的な位置（音響信号変化点１２ａ〜１２ｉや映像信号変化点１３ａ〜１３ｉ）を特定する信号変化点データ（音響信号変化点データ１２や映像信号変化点データ１３）を抽出する。
そして、これらの抽出された話題境界点データ１１、及び信号変化点データ（音響信号変化点データ１２や映像信号変化点データ１３）は、ストーリー境界点抽出手段６に送られ、そこで、この話題境界点データ１１と、信号変化点データ（音響信号変化点データ１２や映像信号変化点データ１３）を用い、コンテンツの内容的な境界点の時系列的な位置（ストーリー境界点１５ａ〜１５ｄ）が特定される。
【００１４】
なお、詳細は後述するが、この例の場合、ストーリー境界点１５ａ〜１５ｄの特定は以下のように行う。
まず、ある時刻ｔにおける探索区間（t〜t+d）に属するスコアが所定の条件を満たした場合、すなわち、時系列的な幅（d）を持った探索区間１６を時系列に沿って移動させ、この探索区間１６に属する話題境界点１１ａ〜１１ｅのスコアと、この探索区間１６に属する信号変化点（音響信号変化点１２ａ〜１２ｉ、及び映像信号変化点１３ａ〜１３ｉ）のスコアとが、所定の条件を満たした場合に、この探索区間１６をストーリー境界候補区間１４ａ〜１４ｄとする（ストーリー境界候補区間データ１４）。そして、このストーリー境界候補区間１４ａ〜１４ｄごとに抽出した、このストーリー境界候補区間１４ａ〜１４ｄに属する話題境界点或いは信号変化点を、ストーリー境界点１５ａ〜１５ｄとする（ストーリー境界点データ１５）。
【００１５】
以上のように、この形態の例では、音声認識結果と、音響信号や映像信号の変化点の抽出結果とを用いてストーリー境界点１５ａ〜１５ｄを抽出することとしたため、コンテンツから音声認識結果が得られなかった区間であっても、容易にストーリー構造を抽出することができる。
次に、この形態の詳細について説明する。
図３は、この形態におけるデータ分析装置２０のハードウェア構成を例示したブロック図である。
図３に例示するように、データ分析装置２０は、コンテンツデータを記憶するコンテンツデータ記憶部２１、ＣＰＵ（Central Processing Unit）２２、キーボードやマウス等の入力部２３、ストーリー境界点データを記憶するストーリー境界点データ記憶部２４、各種データを記憶する記憶部２５、データ分析プログラムを記憶するデータ分析プログラムメモリ２６、液晶ディスプレイ等の表示部２７、及びこれらをデータのやり取り可能に接続するバス２８を有している。
【００１６】
図４は、図３に例示したハードウェア構成において、データ分析プログラムメモリ２６に記憶されたデータ分析プログラムを実行することによって構築されるデータ分析装置２０の処理機能を例示したブロック図である。また、図５は、記憶部２０ｅに記憶されるデータの構成を例示し、図６は、重み係数記憶部２０ｉに記憶されたデータの構成を例示している。さらに、図７は、データ分析装置２０によって行われるデータ分析方法を説明するためのフローチャートである。
以下、これらの図４〜図７を用い、この形態におけるデータ分析装置２０の機能構成及びそのデータ分析方法について説明を行っていく。なお、以下の説明における音声ポーズデータ（VOP）、音声認識結果データ（VOR）、話題境界点データ（VORB）、音響信号変化点データ（BGC）、音響ポーズデータ（BGP）、映像信号変化点データ（VIC）は、１つの音声データ（VO）、音響データ（BG）、或いは画像データ（VI）に対し、それぞれ１つ或いは複数抽出され、処理されるものとし、それぞれのデータを区別する添え字の記載は省略する(第２及び第３の実施の形態についても同様)。
【００１７】
まず、コンテンツデータ記憶部２０ａから、音声データ（VO）を抽出し（ステップＳ１）、この音声データ（VO）を音声認識部２０ｃに送る。次に、この音声認識部２０ｃにおいて、この音声データ（VO）の音声ポーズデータ（VOP）１００を抽出し（ステップＳ２）、さらに、音声認識処理により、音声データ（VO）によって特定される音声の文字列やその音声認識スコア（音声認識結果データ（VOR））を抽出する（ステップＳ３）。
ここで、「ポーズ」とは、音声開始・終了時等の無音状態（例えば、音情報の振幅の自乗和が所定のしきい値以下である状態）が所定時間以上（この例では０．５秒以上）継続する状態をいい、この例の音声ポーズデータ（VOP）１００は、音声データ（VO）によって特定される音声のポーズ部分の時刻（VOP1）１０１、及びその継続時間（msec）であるポーズ長（VOP2）１０２からなるデータである。なお、「時刻」とは、時系列を有するコンテンツの時系列的な位置をいい、コンテンツの先頭からみた再生時間のみではなく、コンテンツの時系列的な位置を示すカウント数等も含む概念である。
【００１８】
また、この例の「音声認識スコア」は、音声認識時における音響モデルとの一致を示す音響スコア、抽出した文字列の文章的なつながりを示す言語モデルとの一致を示す言語スコア、音声認識結果の信頼性（例えば、対立候補の数と、選択された単語と対立候補のスコア差とに基づく値）を示す信頼性スコアによって構成され、この「音声認識スコア」の値が大きいほど、その認識結果が正確であることを示している。
そして、この例の場合、音声認識部２０ｃで抽出された音声ポーズデータ（VOP）は、記憶部２０ｅに送られ、そこで記憶される他（図５）、音声認識結果データ（VOR）とともに話題境界点抽出部２０ｄに送られる。
【００１９】
これらの情報が送られた話題境界点抽出部２０ｄは、これらの情報をもとに、話題境界点データ（VORB）１１０を抽出し（ステップＳ４）、抽出した話題境界点データ（VORB）１１０を記憶部２０ｅに送って、記憶部２０ｅに記録する（図５）。
なお、この例の話題境界点データ（VORB）１１０は、音声認識結果データ（VOR）の文字列を、意味的にまとまりのある単位に分割した場合の分割点の時刻（VORB1）１１１、及び文字列を意味的にまとまりのある単位に分割する際に付される境界スコア（VORB2）１１２（話題境界点のスコア）からなるデータである。
【００２０】
また、この話題境界点データ（VORB）１１０の抽出は、例えば、音声ポーズデータ（VOP）１００の時刻（VOP1）１０１によって仕切られる「コンテンツの文章単位」を、分割される最小単位とし、これに特開２００２−３４２３２４号公報に記載された方法を用いて行う。すなわち、まず、単語の意味を表現する単語ベクトルが格納されている概念ベースを検索することによって、音声認識結果データ（VOR）の文字列が有する各単語に対応する単語ベクトルを取得する。なお、この単語ベクトルは、意味的に近似している単語間ほど距離が近く、意味的に類似していない単語間ほど距離が遠くなるような値が設定されたベクトルである。次に、音声ポーズデータ（VOP）１００の時刻（VOP1）１０１で仕切られる単語の境界（単語境界）の前後に、所定個数の単語の集合である単語列をとり、各単語列を構成する単語の単語ベクトルの情報から単語列結束度（前後の単語列の類似尺度、或いは距離尺度）を算出する。すなわち、例えば、これらの単語ベクトルの和、又は重心により、前後の単語列の類似尺度、又は単語列結束度を求める。そして、この単語列結束度が類似尺度である場合、極小の単語境界を、距離尺度である場合、極大の単語境界を、話題境界点と認定する。
【００２１】
さらに、この例の場合、境界スコア（VORB2）１１２は、話題境界点データ（VORB）１１０を抽出する際のスコア（単語列結束度）を用いて生成されたスコアであり、単語列結束度が類似尺度の場合、単語列結束度が大きければ大きいほど大きな値をとり、単語列結束度が距離尺度の場合、単語列結束度が大きければ大きいほど小さな値をとるスコアである。
また、音声認識時の「音声認識スコア」の値が所定値以下の単語を、話題境界点データ（VORB）１１０抽出時における評価対象から除外する、或いは、話題境界点データ（VORB）１１０抽出時に「音声認識スコア」に応じた単語に重み付けを行うこと等により、境界スコア（VORB2）１１２に音声認識スコアを反映させることとしてもよい（この場合、境界スコア（VORB2）１１２は、パターン認識処理時におけるスコアを用いて生成されたスコアということができる）。
【００２２】
次に、この例では、コンテンツデータ記憶部２０ａから、音響データ（BG）を抽出し（ステップＳ５）、この音響データ（BG）を音響信号変化点抽出部２０ｆに送る。そして、この音響信号変化点抽出部２０ｆにおいて、音響信号変化点データ（BGC）１２０と音響ポーズデータ（BGP）１３０とを抽出し（ステップＳ６）、これらを記憶部２０ｅに送った後、記憶部２０ｅで記憶させる（図５）。ここで、音響信号変化点データ（BGC）１２０は、音響データ（BG）が有する音響信号の物理的な変化点の時刻（BGC1）１２１、及びこの変化点を抽出する際に付された変化スコア（BGC2）１２２からなるデータであり、音響ポーズデータ（BGP）１３０は、音響データ（BG）によって特定される音のポーズ部分（音声・音楽の開始・終了時等）の時刻（BGP1）１３１、及びその継続時間（msec）であるポーズ長（BGP2）１３２からなるデータである。
【００２３】
なお、この音響信号の物理的な変化点の時刻（BGC1）の抽出は、例えば、以下のような方法を用いる。すなわち、まず、音声区間、音楽区間それぞれのデータを用い、音声区間を表すモデル及び音楽区間を表すモデルを学習する。そして、それらのモデルを適用し、入力された信号にスコア付け（それぞれのモデルとの一致度を示すスコア付け）を行い、音声区間モデルと一致した場合に付されるスコア（音声区間スコア）と、音楽区間モデルと一致した場合に付されるスコア（音楽区間スコア）との大小関係が逆転した点を変化点とする。この際、変化スコア（BGC2）は、例えば、時刻（BGC1）における「音声区間スコア」と「音楽区間スコア」との差を用いて生成され、この差が大きいほど変化スコア（BGC2）の値が大きくなるように生成される。
【００２４】
また、音声区間を表すモデルの学習は、例えば、以下の方法を用いる。すなわち、音声が存在する場合には、周波数方向に整数倍あるいはそれに近い帯状のスペクトルが観測できる。そのため、周波数方向に適当な間隔のくし形フィルターを用意し、くしの間隔を変化させ、又は周波数方向に移動させながら、くしの頂点でのスペクトルパワーの総和を求める。そして、音声が存在し、ハーモニクスが存在する場合には、スペクトルパワーの総和が大きくなるため、このスペクトルパワーの総和が所定のしきい値を超えた領域を、音声区間を表すモデルとして学習する（例えば、特開平８−１７９７９１号公報参照）。
【００２５】
また、音楽区間を表すモデルの学習は、例えば、以下の方法を用いる。すなわち、音情報の周波数スペクトルを算出し、そのケプストラムを求め、さらに、そのケプストラムの周波数方向の変動がない軌跡の平均持続時間を算出し、この平均持続時間が所定のしきい値を超える領域を、音楽区間を表すモデルとして学習する（例えば、特開平８−１７９７９１号公報参照）。
また、音響ポーズデータ（BGP）１３０については、例えば、音情報の振幅の自乗和が所定のしきい値以下である状態の発生時刻を時刻（BGP1）１３１とし、その継続時間をポーズ長（BGP2）１３２とする。
【００２６】
次に、この例では、コンテンツデータ記憶部２０ａから、映像データ（VI）を抽出し（ステップＳ７）、この映像データ（VI）を映像信号変化点抽出部２０ｇに送る。そして、この映像信号変化点抽出部２０ｇにおいて、映像信号変化点データ（VIC）１４０を抽出し（ステップＳ８）、これを記憶部２０ｅに送った後、記憶部２０ｅで記録させる（図５）。ここで、映像信号変化点データ（VIC）１４０は、映像データ（VI）が有する音響信号の物理的な変化点の時刻（VIC1）１４１、及びこの変化点を抽出する際に付された変化スコア（VIC2）１４２からなるデータである。
【００２７】
なお、この映像データ（VI）が有する音響信号の物理的な変化点の時刻（VIC1）１４１の抽出方法としては、例えば、取り込まれた一連の映像データ（VI）の時間的に隣り合う２枚の画像I_t，I_t-1にそれぞれ対応する画素の輝度値の差を計算し、その絶対値の画面全体に渡る和（フレーム間差分Ｄ(t)）が、所定のしきい値を超えた場合、このｔを変化点の時刻（VIC1）１４１とする方法がある（大辻、外村、大庭：「輝度情報を使った動画像ブラウジング」、電気情報通信学会技術報告、IE90-103,1991参照。）。この際、変化スコア（VIC2）１４２は、例えば、フレーム間差分Ｄ(t)（信号変化点データを抽出する際のスコアに相当）を用いて生成される。
【００２８】
また、フレーム間差分の代わりに、画素変化面積、輝度ヒストグラム差分、ブロック別色相関、χ² 検定量等をＤ(t)とし、このＤ(t)が所定のしきい値を超えた場合、このｔを変化点の時刻（VIC1）１４１としてもよい（大辻、外村：「映像カット自動検出方式の検討」、テレビジョン学会技術報告、Vol.16,No.43,pp.7-12参照）。この際、変化スコア（VIC2）１４２は、例えば、画素変化面積、輝度ヒストグラム差分、ブロック別色相関、χ² 検定量等（信号変化点データを抽出する際のスコアに相当）を用いて生成される。
また、Ｄ(t) をそのまましきい値処理するのではなく、各種時間フィルターをＤ(t)に対して作用した結果をしきい値処理し、変化点の時刻（VIC1）１４１を求めることとしてもよい（K.Otsuji snd Y.Tonomura:"Projection Detecting Filter for Video Cut Detection"Proc.of ACM Multimedia 93,1993,pp.251-257参照）。
【００２９】
上述のように記憶部２０ｅに記憶された音声ポーズデータ（VOP）１００、話題境界点データ（VORB）１１０、音響信号変化点データ（BGC）１２０、音響ポーズデータ（BGP）１３０、及び映像信号変化点データ（VIC）１４０は、ストーリー境界候補区間抽出部２０ｈに送られる。ストーリー境界候補区間抽出部２０ｈは、これらの情報（VOP,VORB, BGC, BGP, VIC）と、重み係数記憶部２０ｉから読み出した重み係数（W）１６２とを用い、ストーリー境界候補区間データ（STS）１５０を抽出する（ステップＳ９）。なお、このステップＳ９の処理の詳細については後述する。また、図６に例示するように、この重み係数（W）１６２は、話題境界点データ（VORB）等の各イベント１６１に割り当てられた係数であり、この例の場合、「話題境界点データ（VORB）」に対して「０．５」が、「音響信号変化点データ（BGC）」に対して「０．２」が、「映像信号変化点データ（VIC）」に対して「０．０５」が、「音声ポーズデータ（VOP）」に対して「０．２」が、「音響ポーズデータ（BGP）」に対して「０．０５」が、それぞれ割り当てられている。
【００３０】
ストーリー境界候補区間データ（STS）１５０は、その区間の開始時刻である始点時刻（STS1）１５１、及び区間の終了時刻である終点時刻（STS2）１５２によって構成され、このようにストーリー境界候補区間抽出部２０ｈで生成されたストーリー境界候補区間データ（STS）１５０は、記憶部２０ｅに送られ、そこで記録される（図５）。
次に、ストーリー境界点抽出部２０ｊにおいて、記憶部２０ｅから、音声ポーズデータ（VOP）１００、話題境界点データ（VORB）１１０、音響信号変化点データ（BGC）１２０、音響ポーズデータ（BGP）１３０、及び映像信号変化点データ（VIC）１４０、ストーリー境界候補区間データ（STS）１５０を読み出し、また、重み係数記憶部２０ｉから重み係数（W）１６２を読み出し、これらのデータを用いてストーリー境界点データ（STB）を抽出する（ステップＳ１０）。ストーリー境界点抽出部２０ｊで抽出されたストーリー境界点データ（STB）は、ストーリー境界点データ記憶部２０ｋに送られ、そこで記録される。
【００３１】
次に、図７におけるステップＳ９の処理の詳細について説明する。
図８は、図７におけるステップＳ９の処理の詳細を説明するためのフローチャートである。以下、このフローチャートに沿って、ストーリー境界候補区間抽出部２０ｈで行われるステップＳ９の処理の詳細を説明する。
この処理では、探索区間（この例では、時刻ＳＴ１から２秒の時間幅を持つ区間）を１秒ステップで０からｎまで移動させ、それぞれのステップの探索区間１６に属する音声ポーズデータ（VOP）１００、話題境界点データ（VORB）１１０、音響信号変化点データ（BGC）１２０、音響ポーズデータ（BGP）１３０、及び映像信号変化点データ（VIC）１４０の各スコアを、重み係数記憶部２０ｉの重み係数（W）で重み付けして加算し、その合計スコア（TSC）がしきい値に達した探索区間をストーリー境界候補区間データ（STS）として抽出する。
【００３２】
具体的には、まず、ST1とTSCとｑ（添え字）を０にリセットし（ステップＳ１１）、ST1≦VORB1≦ST1+2を満たすか否かを判断する（ステップＳ１２）。すなわち探索区間に話題境界点が存在するか否かを判断する。ここで、ST1≦VORB1≦ST1+2を満たした場合、これを満たす全ての話題境界点データ（VORB）に対し、TSC←TSC+VORB2＊0.5の演算を行って（ステップＳ１３）ステップＳ１４へ進み、満たさなかった場合、この演算を行うことなくステップＳ１４に進む。なお、ここで乗算される「0.5」は、前述のように、話題境界点データ（VORB）に割り当てられた重み係数（W）である。
【００３３】
次に、ST1≦BGC1≦ST1+2を満たすか否かを判断する（ステップＳ１４）。ここで、ST1≦BGC1≦ST1+2を満たした場合、これを満たす全ての音響信号変化点データ（BGC）に対し、TSC←TSC+BGC2＊0.2の演算を行って（ステップＳ１５）ステップＳ１６へ進み、満たさなかった場合、この演算を行うことなくステップＳ１６に進む。なお、ここで乗算される「0.2」は、前述のように、音響信号変化点データ（BGC）に割り当てられた重み係数（W）である。
次に、ST1≦VIC1≦ST1+2を満たすか否かを判断する（ステップＳ１６）。ここで、ST1≦VIC1≦ST1+2を満たした場合、これを満たす全ての映像信号変化点データ（VIC）に対し、TSC←TSC+VIC2＊0.05の演算を行って（ステップＳ１７）ステップＳ１８へ進み、満たさなかった場合、この演算を行うことなくステップＳ１８に進む。なお、ここで乗算される「0.05」は、前述のように、映像信号変化点データ（VIC）に割り当てられた重み係数（W）である。
【００３４】
次に、ST1≦VOP1≦ST1+2を満たすか否かを判断する（ステップＳ１８）。ここで、ST1≦VOP1≦ST1+2を満たした場合、これを満たす全ての音声ポーズデータ（VOP）に対し、TSC←TSC+(VOP2/3000)＊0.05の演算を行って（ステップＳ１９）ステップＳ２０へ進み、満たさなかった場合、この演算を行うことなくステップＳ２０に進む。なお、ここで乗算される「0.05」は、前述のように、音声ポーズデータ（VOP）に割り当てられた重み係数（W）である。
次に、ST1≦BGP1≦ST1+2を満たすか否かを判断する（ステップＳ２０）。ここで、ST1≦BGP1≦ST1+2を満たした場合、これを満たす全ての音響ポーズデータ（BGP）に対し、TSC←TSC+(BGP2/1000)＊0.05の演算を行って（ステップＳ２１）ステップＳ２２へ進み、満たさなかった場合、この演算を行うことなくステップＳ２２に進む。なお、ここで乗算される「0.05」は、前述のように、音響ポーズデータ（BGP）に割り当てられた重み係数（W）である。
【００３５】
そして、以上のように演算された合計スコアTSCが、しきい値0.3以上であるか否かを判断する（ステップＳ２２）。ここで、TSCが0.3以上であった場合、STS1_qにST1を、STS2_qにST1+2をそれぞれ代入し（ステップＳ２３）、その演算結果(STS1_q,STS2_q)をストーリー候補区間データ（STS）１５０として、記憶部２０ｅに記憶させ（ステップＳ２４）、ステップＳ２５に進む。なお、添え字ｑは、ステップS２４の処理が行われるたびに、q←q+1の演算が行われるものとする。一方、TSCが0.3以上でなかった場合、これらの演算・記録を行うことなく直接ステップＳ２５に進む。
【００３６】
このステップＳ２５では、ST1+2≧nであるか否か（すなわち、探索区間の終了点ST2がコンテンツの終了時刻ｎに達したか否か）を判断し、ST1+2≧nであれば処理を終了し、ST1+2≧nでなければ、ST1に１を加算し、合計スコアTSCを０にリセットして（ステップＳ２６）、ステップＳ１２に戻る。
次に、図７におけるステップＳ１０の処理の詳細について説明する。
図９は、図７におけるステップＳ１０の詳細を説明するためのフローチャートである。以下、このフローチャートに沿って、ストーリー境界点抽出部２０ｊで行われるステップＳ１０の処理の詳細を説明する。
【００３７】
この処理では、ストーリー境界候補区間（STS1〜STS2の区間）ごとに、このストーリー境界候補区間に属する音声ポーズデータ（VOP）１００、話題境界点データ（VORB）１１０、音響信号変化点データ（BGC）１２０、音響ポーズデータ（BGP）１３０、及び映像信号変化点データ（VIC）１４０の各スコア（VOP2,VORB2,BGC2,BGP2,VIC2）を、重み係数記憶部２０ｉの重み係数（W）で重み付けを行って比較し、この重み付け後のスコアが、このストーリー境界候補区中最大であった音声ポーズデータ（VOP）１００、話題境界点データ（VORB）１１０、音響信号変化点データ（BGC）１２０、音響ポーズデータ（BGP）１３０、或いは映像信号変化点データ（VIC）１４０の時刻（VOP1,VORB1,BGC1,BGP1,VIC1）を、ストーリー境界点とする。なお、図９に例示する処理は、各ストーリー境界候補区間に対し、ストーリー境界候補区間ごとに行われる処理であり、以下では、STS1及びSTS2の添え字ｑを省略する。
【００３８】
この例の場合、まず、SC₁〜ＳＣ_m、及びｋの値を０にリセットする（ステップＳ３１）。ここで、「SC₁〜ＳＣ_m」とは、話題境界点データ（VORB）１１０等の各イベントに対応する重み付け後のスコアを代入する変数であり、ｋは、「SC₁〜ＳＣ_m」の添え字を示す整数であり、mは十分大きな整数である。
次に、STS1≦VORB1≦STS2を満たすか否かを判断する（ステップＳ３２）。すなわちストーリー境界候補区間に話題境界点が存在するか否かを判断する。ここで、STS1≦VORB1≦STS2を満たした場合、これを満たす全ての話題境界点データ（VORB）に対し、SC_k←VORB2＊0.5の演算を行い、その演算のたびにｋ←k+1の演算を行う（ステップＳ3３）。なお、ここで乗算される「0.5」は、前述のように割り当てられた重み係数（W）である。この演算結果SC_kは、それに対応する話題境界点データ（VORB）に関連付けて記憶部２０ｅに記憶され、ステップＳ３４に進む。一方、STS1≦VORB1≦STS2を満たさなかった場合、この演算・記録を行わずステップＳ３４に進む。
【００３９】
次に、STS1≦BGC1≦STS2を満たすか否かを判断する（ステップＳ３４）。ここで、STS1≦BGC1≦STS2を満たした場合、これを満たす全ての音響信号変化点データ（BGC）に対し、SC_k←BGC2＊0.2の演算を行い、その演算のたびにｋ←k+1の演算を行う（ステップＳ３５）。なお、ここで乗算される「0.2」は、前述のように割り当てられた重み係数（W）である。この演算結果SC_kは、それに対応する音響信号変化点データ（BGC）に関連付けて記憶部２０ｅに記憶され、ステップＳ３６に進む。一方、STS1≦BGC1≦STS2を満たさなかった場合、この演算・記録を行わずステップＳ３６に進む。
【００４０】
次に、STS1≦VIC1≦STS2を満たすか否かを判断する（ステップＳ３６）。ここで、STS1≦VIC1≦STS2を満たした場合、これを満たす全ての映像信号変化点データ（VIC）に対し、SC_k←VIC2＊0.05の演算を行い、その演算のたびにｋ←k+1の演算を行う（ステップＳ３７）。なお、ここで乗算される「0.05」は、前述のように割り当てられた重み係数（W）である。この演算結果SC_kは、それに対応する映像信号変化点データ（VIC）に関連付けて記憶部２０ｅに記憶され、ステップＳ３８に進む。一方、STS1≦VIC1≦STS2を満たさなかった場合、この演算・記録を行わずステップＳ３８に進む。
【００４１】
次に、STS1≦VOP1≦STS2を満たすか否かを判断する（ステップＳ３８）。ここで、STS1≦VOP1≦STS2を満たした場合、これを満たす全ての音声ポーズデータ（VOP）に対し、SC_k←(VOP2/3000)＊0.05の演算を行い、その演算のたびにｋ←k+1の演算を行う（ステップＳ３９）。なお、ここで乗算される「0.05」は、前述のように割り当てられた重み係数（W）である。この演算結果SC_kは、それに対応する音声ポーズデータ（VOP）に関連付けて記憶部２０ｅに記憶され、ステップＳ４０に進む。一方、STS1≦VOP1≦STS2を満たさなかった場合、この演算・記録を行わずステップＳ４０に進む。
【００４２】
次に、STS1≦BGP1≦STS2を満たすか否かを判断する（ステップＳ４０）。ここで、STS1≦BGP1≦STS2を満たした場合、これを満たす全ての音響ポーズデータ（BGP）に対し、SC_k←(BGP2/1000)＊0.05の演算を行い、その演算のたびにｋ←k+1の演算を行う（ステップＳ４１）。なお、ここで乗算される「0.05」は、前述のように割り当てられた重み係数（W）である。この演算結果SC_kは、それに対応する音響ポーズデータ（BGP）に関連付けて記憶部２０ｅに記憶され、ステップＳ４２に進む。一方、STS1≦BGP1≦STS2を満たさなかった場合、この演算・記録を行わずステップＳ４２に進む。
【００４３】
その後、記憶部２０ｅに記憶した各スコアSC_p（0≦p≦k）を読み出し、読み出したSC_pから、並び替えアルゴリズム等によって、その最大値を検出する。そして、その最大のSC_pに対応する（関連付けられている）イベントの時刻（VORB1,BGC1,VIC1,VOP1,BGP1）を、ストーリー境界点データ（STB）とし、その値をストーリー境界点データ記憶部２０ｋに記録する（ステップＳ４２）。
次に、このように抽出されたストーリー境界点データ（STB）の使用例について説明する。
図１０は、ストーリー境界点データ（STB）を使用する際のデータ分析装置２０の機能構成を例示したブロック図である。
【００４４】
この例では、コンテンツデータ記憶部２０ａには前述したデータの他、音声認識部２０ｃ（図４）で生成された音声認識結果データ（VOR）も記録されているものとする。
この例の場合、まず、制御部２０ｎは、コンテンツデータ記憶部２０ａ、記憶部２０ｅ及びストーリー境界点データ記憶部２０ｋにアクセスし、コンテンツデータ記憶部２０ａからコンテンツデータを、記憶部２０ｅから音声ポーズデータ（VOP）、音響ポーズデータ（BGP）及び音声認識結果データ（VOR）を、ストーリー境界点データ記憶部２０ｋからストーリー境界点データ（STB）を、それぞれ読み出し、これらの情報を用い、図１１に例示するストーリー構造表示画面２００を生成する。生成されたストーリー構造表示画面２００は、表示部２０ｍに送られ、表示部２０ｍはこれを表示する。
【００４５】
図１１に例示するように、ストーリー構造表示画面２００は、コンテンツをストーリー境界点で分割した複数のストーリー区分２１０〜２３０によって構成される。そして、各ストーリー区分２１０〜２３０は、そのストーリーの開始時刻（「開始」）と「長さ」を表示するストーリー範囲表示部２１１、「ストーリー詳細表示」、「メタデータ表示」、「ストーリー再生」、「字幕再生」のクリック操作を受け付けるアイコン２１２、ストーリーの要約（「サマリー」）とそのキーワード（「トピック」）を表示する概略表示部２３１、及びコンテンツの先頭静止映像及びポーズ後の先頭静止映像を表示するポーズ境界画面２１４ａ〜２１４ｈによって構成される。
【００４６】
この例の場合、ストーリー区分２１０〜２３０は、ストーリー境界点データ（STB）を用い、コンテンツをストーリー境界点で分割することによって決定され、ストーリー範囲表示部２１１に表示される「開始」は、ストーリー境界点データ（STB）が示す時刻（各ストーリー区分２１０〜２３０の開始時刻）によって、「長さ」は、ストーリー境界点データ（STB）が示す各ストーリー区分２１０〜２３０の開始時刻と終了時刻との差によって、それぞれ生成される。また、概略表示部２１３の「サマリー」及び「トピック」は、各ストーリー区分２１０〜２３０に対応する音声認識結果データ（VOR）を、各ストーリー区分２１０〜２３０ごとに加工することによって生成される。また、ポーズ境界画面２１４ａ〜２１４ｈは、音声ポーズデータ（VOP）の時刻（VOP1）１０１から、ポーズ長（VOP2）１０２だけ経過した時刻のコンテンツの静止画像、及び音響ポーズデータ（BGP）の時刻（BGP1）１３１から、ポーズ長（BGP2）１３２だけ経過した時刻のコンテンツの静止画像、を時系列順に並べることによって生成される。
【００４７】
このように表示されたストーリー構造表示画面２００を閲覧した利用者は、入力部２０ｐからのマウス入力等により、閲覧を希望するストーリー区分２１０〜２３０のアイコン２１２のクリック操作を行う。この入力情報は制御部２０ｎに伝えられ、これによりアイコン２１２のクリック操作が制御部２０ｎに検出されると、制御部はそれに応じた処理、例えば、選択されたストーリー区分のコンテンツをコンテンツデータ記憶部２０ａから読み出し、そのコンテンツを表示部２０ｍに送り表示させる処理が行われる。
なお、利用者が入力部２０ｐに入力したキーワードを用い、制御部２０ｎにおいて、音声認識結果データ（VOR）が示す文字列を検索し、そのキーワードを有するストーリー区分を表示部２０ｍに表示し、利用者により、表示部２０ｍに表示されたストーリー区分のコンテンツの閲覧を希望する旨の入力が入力部２０ｐに行われた場合に、そのストーリー区分のコンテンツを表示部２０ｍに表示させることとしてもよい。
【００４８】
以上説明したように、この形態では、コンテンツデータによって特定される文字列を、音声認識処理によって抽出し（ステップＳ３）、抽出された文字列を、意味的にまとまりのある単位に分割して話題境界点データ（VORB）を抽出し（ステップＳ４）、データによって特定されるコンテンツの物理的な変化点の時系列的な位置（信号変化点）を特定する信号変化点データ（VOP,BGC,BGP,VIC）を抽出し（ステップＳ２、Ｓ６、Ｓ８）、この話題境界点データ（VORB）と、信号変化点データ（VOP,BGC,BGP,VIC）とを用い、ストーリー境界点データ（STB）を抽出することとした。そのため、コンテンツから音声認識結果が得られなかった区間であっても、その他の信号変化点データ（VOP,BGC,BGP,VIC）を用いることにより、ストーリー構造を容易に抽出することが可能となる。
【００４９】
また、この形態では、話題境界点データ（VORB）だけではなく、信号変化点データ（VOP,BGC,BGP,VIC）をも用いてストーリー境界点を抽出することとしたため、抽出したストーリー境界点の正確性・的確性が向上する。
さらに、この形態では、ストーリー境界候補区間抽出部２０ｈ（ストーリー境界抽出手段を構成）において、時系列的な幅を持った探索区間に属する話題境界点のスコアと、この探索区間に属する信号変化点のスコアとが、所定の条件を満たした場合に、この探索区間をストーリー境界候補区間とし（ステップＳ９）、ストーリー境界点抽出部２０ｊ（ストーリー境界抽出手段を構成）において、このストーリー境界候補区間ごとに抽出した、このストーリー境界候補区間に属する話題境界点或いは信号変化点を、ストーリー境界点とすることとした（ステップＳ１０）。これにより、コンテンツの１つのストーリー境界点に複数の話題境界点及び信号変化点が対応し、これらの話題境界点及び信号変化点が完全に一致しない場合（検出方法による誤差等に起因）であっても、検出された話題境界点及び信号変化点すべてがストーリー境界点であると判断されることを防止することができる。その結果、複数の話題境界点及び信号変化点によるストーリー境界点の抽出が、正確かつ現実的に実行可能なる。
【００５０】
さらに、この形態では、話題境界点のスコア（VORB2）及び信号変化点のスコア（VOP2,BGC2,BGP2,VIC2）を、重み係数（W）で重み付けを行って合計した値が、所定のしきい値に達した場合に、この話題境界点或いは信号変化点が属する探索区間を、ストーリー境界候補区間とした（ステップＳ１１〜Ｓ２６）。これにより、話題境界点や信号変化点の検出信頼度をストーリー境界候補区間の検出に反映させ、さらに、重み係数（W）の調整により、話題境界点や信号変化点のストーリー境界候補区間検出時の寄与度を自由に設定することができる。その結果、適切なストーリー境界候補区間の検出を好適に行うことが可能となる。
【００５１】
また、この形態では、ストーリー境界候補区間ごとに、このストーリー境界候補区間に属する話題境界点及び信号変化点のスコア（VORB2,VOP2,BGC2,BGP2,VIC2）を、重み係数（W）で重み付けを行って比較し、この重み付け後のスコア（SC_k）が、このストーリー境界候補区中最大であった話題境界点或いは信号変化点を、ストーリー境界点とすることとした（ステップＳ３１〜Ｓ４２）。これにより、ストーリー境界候補区中最も信頼性が高いデータを用いてストーリー境界点を決定することができ、ストーリー境界点の抽出を適切に行うことができる。
さらに、この形態では、音声認識処理時におけるスコアを用いて、話題境界点のスコア（VRB2）を生成することとしたため、音声認識処理の信頼性をストーリー境界点の抽出処理に反映させることが可能となる。その結果、音声認識誤りの影響を低減させ、適切なストーリー境界点の抽出が可能となる。
【００５２】
また、この形態では、話題境界点データを抽出する際のスコアを用いて、話題境界点のスコア（VRB2）を生成することとしたため、話題境界点抽出処理の信頼性をストーリー境界点の抽出処理に反映させることが可能となる。その結果、話題境界点抽出の誤りの影響を低減させ、適切なストーリー境界点の抽出が可能となる。
さらに、この形態では、信号変化点データを抽出する際のスコアを用いて、信号変化点のスコア（VORB2,VOP2,BGC2,BGP2,VIC2）を生成することとしたため、信号変化点データ抽出処理の信頼性をストーリー境界点の抽出処理に反映させることが可能となる。その結果、ストーリー境界点の抽出処理の誤りの影響を低減させ、適切なストーリー境界点の抽出が可能となる。
【００５３】
次に、この発明における第２の実施の形態について説明する。
この形態は第１の実施の形態の変形例であり、ストーリー境界候補区間データ抽出時（ステップＳ９）に用いる重み係数（第１の重み係数）と、ストーリー境界点データ抽出時（ステップＳ１０）に用いる重み係数（第２の重み係数）とが異なる係数である点が第１の実施の形態と異なる。以下では、第１の実施の形態と異なる点のみを説明し、第１の実施の形態と共通する事項については説明を省略する。なお、この形態で第１の実施の形態と同一の機能構成を図示する場合には、第１の実施の形態と同一の番号を用いる。
【００５４】
図１２は、図３に例示したハードウェア構成において、データ分析プログラムメモリ２６に記憶されたデータ分析プログラムを実行することによって構築される本形態のデータ分析装置３００の処理機能を例示したブロック図である。
この形態の構成と第１の実施の形態の構成との相違点は、第１の実施の形態の重み係数記憶部２０ｉの代わりに、ストーリー境界候補区間データ抽出時（ステップＳ９）に用いる重み係数（W）を記憶した境界候補区間重み係数記憶部３０１、及びストーリー境界点データ抽出時（ステップＳ１０）に用いる重み係数（W'）を記憶した境界点重み係数記憶部３０２を設けた点である。なお、境界候補区間重み係数記憶部３０１は、ストーリー境界候補区間抽出部２０ｈに重み係数（W）の提供が可能なように構成され、境界点重み係数記憶部３０２は、ストーリー境界点抽出部２０ｊに重み係数（W'）の提供が可能なように構成されている。
【００５５】
図１３は、境界候補区間重み係数記憶部３０１に記憶されたデータの構成を例示している。
図１３に例示するように、この重み係数（W'）３１２は、話題境界点データ（VORB）等の各イベント３１１に割り当てられた係数であり、この例の場合、「話題境界点データ（VORB）」に対して「０．２」が、「音響信号変化点データ（BGC）」に対して「０．３」が、「映像信号変化点データ（VIC）」に対して「０．３」が、「音声ポーズデータ（VOP）」に対して「０．１」が、「音響ポーズデータ（BGP）」に対して「０．１」が、それぞれ割り当てられている。
【００５６】
なお、境界候補区間重み係数記憶部３０１に記憶された重み係数（W）は、図６に例示したものと同一とし、ここでは説明を省略する。
次に、この形態におけるデータ分析装置３００によって行われるデータ分析方法について説明する。なお、第１の実施の形態で説明したステップＳ１〜Ｓ９の処理（図７）は、第２の形態におけるデータ分析方法についても同様に行われ、その相違は、重み係数記憶部２０ｉ（図４）が境界候補区間重み係数記憶部３０１（図１２）に置き換わる点のみである。従って、ここでは、その説明を省略し、第１の実施の形態との相違点である図７のステップＳ１０の処理のみについて説明を行う。
【００５７】
図１４は、第２の実施の形態における図７のステップＳ１０の詳細を説明するためのフローチャートである。以下、このフローチャートに沿って、ストーリー境界点抽出部２０ｊ（図４）で行われるこの形態のステップＳ１０の処理の詳細を説明する。
この処理では、ストーリー境界候補区間（STS1〜STS2の区間）ごとに、このストーリー境界候補区間に属する音声ポーズデータ（VOP）１００等の各スコア（VOP2,VORB2,BGC2,BGP2,VIC2）を、重み係数記憶部３０２の重み係数（W'）で重み付けを行って比較し、この重み付け後のスコアが、このストーリー境界候補区中最大であった音声ポーズデータ（VOP）１００等の時刻（VOP1,VORB1,BGC1,BGP1,VIC1）を、ストーリー境界点とする。なお、図１４に例示する処理は、各ストーリー境界候補区間に対し、ストーリー境界候補区間ごとに行われる処理であり、以下では、STS1及びSTS2の添え字ｑを省略する。
【００５８】
この例の場合、まず、SC₁〜ＳＣ_m、及びｋの値を０にリセットし（ステップＳ５１）、STS1≦VORB1≦STS2を満たすか否かを判断する（ステップＳ５２）。ここで、STS1≦VORB1≦STS2を満たした場合、これを満たす全ての話題境界点データ（VORB）に対し、SC_k←VORB2＊0.2の演算を行い、その演算のたびにｋ←k+1の演算を行う（ステップＳ５３）。なお、ここで乗算される「0.2」は、前述のように割り当てられた重み係数（W'）である。この演算結果SC_kは、それに対応する話題境界点データ（VORB）に関連付けて記憶部２０ｅに記憶され、ステップＳ５４に進む。一方、STS1≦VORB1≦STS2を満たさなかった場合、この演算・記録を行わずステップＳ５４に進む。
【００５９】
次に、STS1≦BGC1≦STS2を満たすか否かを判断する（ステップＳ５４）。ここで、STS1≦BGC1≦STS2を満たした場合、これを満たす全ての音響信号変化点データ（BGC）に対し、SC_k←BGC2＊0.3の演算を行い、その演算のたびにｋ←k+1の演算を行う（ステップＳ５５）。なお、ここで乗算される「0.3」は、前述のように割り当てられた重み係数（W'）である。この演算結果SC_kは、それに対応する音響信号変化点データ（BGC）に関連付けて記憶部２０ｅに記憶され、ステップＳ５６に進む。一方、STS1≦BGC1≦STS2を満たさなかった場合、この演算・記録を行わずステップＳ５６に進む。
【００６０】
次に、STS1≦VIC1≦STS2を満たすか否かを判断する（ステップＳ５６）。ここで、STS1≦VIC1≦STS2を満たした場合、これを満たす全ての映像信号変化点データ（VIC）に対し、SC_k←VIC2＊0.03の演算を行い、その演算のたびにｋ←k+1の演算を行う（ステップＳ５７）。なお、ここで乗算される「0.03」は、前述のように割り当てられた重み係数（W'）である。この演算結果SC_kは、それに対応する映像信号変化点データ（VIC）に関連付けて記憶部２０ｅに記憶され、ステップＳ５８に進む。一方、STS1≦VIC1≦STS2を満たさなかった場合、この演算・記録を行わずステップＳ５８に進む。
【００６１】
次に、STS1≦VOP1≦STS2を満たすか否かを判断する（ステップＳ５８）。ここで、STS1≦VOP1≦STS2を満たした場合、これを満たす全ての音声ポーズデータ（VOP）に対し、SC_k←(VOP2/3000)＊0.1の演算を行い、その演算のたびにｋ←k+1の演算を行う（ステップＳ５９）。なお、ここで乗算される「0.1」は、前述のように割り当てられた重み係数（W'）である。この演算結果SC_kは、それに対応する音声ポーズデータ（VOP）に関連付けて記憶部２０ｅに記憶され、ステップＳ６０に進む。一方、STS1≦VOP1≦STS2を満たさなかった場合、この演算・記録を行わずステップＳ６０に進む。
【００６２】
次に、STS1≦BGP1≦STS2を満たすか否かを判断する（ステップＳ６０）。ここで、STS1≦BGP1≦STS2を満たした場合、これを満たす全ての音響ポーズデータ（BGP）に対し、SC_k←(BGP2/1000)＊0.1の演算を行い、その演算のたびにｋ←k+1の演算を行う（ステップＳ６１）。なお、ここで乗算される「0.1」は、前述のように割り当てられた重み係数（W'）である。この演算結果SC_kは、それに対応する音響ポーズデータ（BGP）に関連付けて記憶部２０ｅに記憶され、ステップＳ６２に進む。一方、STS1≦BGP1≦STS2を満たさなかった場合、この演算・記録を行わずステップＳ６２に進む。
【００６３】
その後、記憶部２０ｅに記憶した各スコアSC_p（0≦p≦k）を読み出し、読み出したSC_pから、並び替えアルゴリズム等によって、その最大値を検出する。そして、その最大のSC_pに対応する（関連付けられている）イベントの時刻（VORB1,BGC1,VIC1,VOP1,BGP1）を、ストーリー境界点データ（STB）とし、その値をストーリー境界点データ記憶部２０ｋに記録する（ステップ６２）。
このように、この形態では、ストーリー境界候補区間データ抽出時（ステップＳ９）に用いる重み係数（W）（第１の重み係数）と、ストーリー境界点データ抽出時（ステップＳ１０）に用いる重み係数（W'）（第２の重み係数）とを別個に設定することとした。そのため、これらの重み係数を異なる係数とすることも可能であり、これにより、ストーリー境界候補区間抽出やストーリー境界部抽出に対する、話題境界点や信号変化点の寄与度の調節を、より高い自由度で行うことができる。例えば、話題境界点を重んじてストーリー境界候補区間を抽出し、映像変化点を重んじてストーリー境界を抽出するといった具合である。その結果、ストーリー境界部の抽出を、より適切に行うことが可能となる。
【００６４】
次に、この発明における第３の実施の形態について説明する。
この形態は第１の実施の形態の変形例であり、ストーリー境界点データ抽出（ステップＳ１０）を、重み係数で重み付けしたスコアを比較するのではなく、イベント自体に付与された優先順位に従って、ストーリー境界候補区間からストーリー境界点を抽出する点が第１の実施の形態と異なる。以下では、第１の実施の形態と異なる点のみを説明し、第１の実施の形態と共通する事項については説明を省略する。なお、この形態で第１の実施の形態と同一の機能構成を図示する場合には、第１の実施の形態と同一の番号を用いる。
【００６５】
図１５は、図３に例示したハードウェア構成において、データ分析プログラムメモリ２６に記憶されたデータ分析プログラムを実行することによって構築される本形態のデータ分析装置４００の処理機能を例示したブロック図である。
この形態の構成と第１の実施の形態の構成との相違点は、第１の実施の形態の重み係数記憶部２０ｉが、ストーリー境界点抽出部２０ｊに重み係数（W）を供給する構成とはなっていない点、及び各イベントの優先順位を示すイベント順位（R）を記憶したイベント順位記憶部４０１を設けた点である。なお、このイベント順位記憶部４０１は、ストーリー境界点抽出部２０ｊに、イベント順位（R）の情報の提供が可能なように構成されている。
【００６６】
図１６は、イベント順位記憶部４０１に記憶されたデータの構成を例示している。
図１６に例示するように、イベント順位記憶部４０１には、映像信号変化点データ（ＶＩＣ）等のイベント４１１にそれぞれ対応付けられた、イベント順位（R）が記憶されている。この例の場合、映像信号変化点データ（VIC）に対してイベント順位「１」が、音響信号変化点データ（BGC）に対してイベント順位「２」が、話題境界点データ（VORB）に対してイベント順位「３」が、それぞれ割り当てられている。なお、前述のように、「イベント順位」は、各イベントの優先順位を示し、同じストーリー境界候補区間に複数のイベントが存在した場合、その中で最もイベント順位が高いイベントの時刻（VORB1,BGC1,VIC1）が、ストーリー境界点として選択される。
【００６７】
次に、この形態におけるデータ分析装置４００によって行われるデータ分析方法について説明する。なお、第１の実施の形態で説明したステップＳ１〜Ｓ９の処理（図７）は、第３の形態におけるデータ分析方法についても同様に行われるため、ここでは、その説明を省略し、第１の実施の形態との相違点である図７のステップＳ１０の処理のみについて説明を行う。
図１７は、第３の実施の形態における図７のステップＳ１０の詳細を説明するためのフローチャートである。以下、このフローチャートに沿って、ストーリー境界点抽出部２０ｊ（図４）で行われるこの形態のステップＳ１０の処理の詳細を説明する。
【００６８】
この処理では、ストーリー境界候補区間（STS1〜STS2の区間）ごとに、このストーリー境界候補区間に属する映像信号変化点データ（VIC）等のイベントから、最も優先順位が高いイベントを選択し、そのイベントの時刻（VIC1,BGC1,VORB1）を、ストーリー境界点とする。なお、図１４に例示する処理は、各ストーリー境界候補区間に対し、ストーリー境界候補区間ごとに行われる処理であり、以下では、STS1及びSTS2の添え字ｑを省略する。
この例の場合、まず、SC₁〜ＳＣ_mを０にリセットし（ステップＳ７１）、イベント順位記憶部４０１からイベント順位（R）４１２を取得し、このイベント順位（R）４１２が「１」であるイベント、すなわち、映像信号変化点データ（VIC）の時刻（VIC1）が、STS1≦VIC1≦STS2を満たすか否かを判断する（ステップＳ７２）。ここで、STS1≦VIC1≦STS2を満たした場合、これを満たす映像信号変化点データ（VIC）のうち、変化スコア（VIC2）が最大の映像信号変化点データ（VIC）の時刻（VIC1）をストーリー境界点データ（STB）とし（ステップＳ７３）、処理を終了する。
【００６９】
一方、STS1≦VIC1≦STS2を満たさない場合、イベント順位記憶部４０１からイベント順位（R）４１２を取得し、このイベント順位（R）４１２が「２」であるイベント、すなわち、音響信号変化点データ（BGC）の時刻（BGC1）が、STS1≦BGC1≦STS2を満たすか否かを判断する（ステップＳ７４）。ここで、STS1≦BGC1≦STS2を満たした場合、これを満たす音響信号変化点データ（BGC）のうち、変化スコア（BGC2）が最大の音響信号変化点データ（BGC）の時刻（BGC1）をストーリー境界点データ（STB）とし（ステップＳ７５）、処理を終了する。
一方、STS1≦BGC1≦STS2を満たさない場合、イベント順位記憶部４０１からイベント順位（R）４１２を取得し、このイベント順位（R）４１２が「３」であるイベント、すなわち、話題境界点データ（VORB）の時刻（VORB1）が、STS1≦VORB1≦STS2を満たすか否かを判断する（ステップＳ７６）。ここで、STS1≦VORB1≦STS2を満たした場合、これを満たす話題境界点データ（VORB）のうち、境界スコア（VORB2）が最大の話題境界点データ（VORB）の時刻（VORB1）をストーリー境界点データ（STB）とし（ステップＳ７７）、処理を終了する。
【００７０】
このように、本形態では、イベント順位（R）４１２によってイベントの優先順位を定め、ストーリー境界候補区間に属するイベントから、最も優先順位が高いイベントを選択し、そのイベントの時刻（VIC1,BGC1,VORB1）をストーリー境界点とした。そのため、ストーリー境界点としてより適切なイベントを優先的に選択することが可能となり、ストーリー境界点の抽出処理を最適化することができる。
なお、この発明は上述の第１〜３の実施に限定されるものではない。例えば、これらの形態では、パターン認識処理として音声認識処理を用い、コンテンツから文字列を抽出することとしたが、パターン認識処理として文字認識処理を用い、コンテンツの映像に表れるテロップ等の文字から文字列を抽出し、その文字列を話題境界点の抽出に利用することとしてもよい。
【００７１】
また、このテロップが現れる時点を映像信号の変化点として抽出し、それを信号変化点としてストーリー境界点の抽出に利用することとしてもよい。
さらに、この形態では、コンピュータ上で所定のプログラムを実行させることにより、データ分析装置を構成することとしたが、これらの処理内容の少なくとも一部を電子回路によってハードウェア的に実現することとしてもよい。
また、第１〜３の実施の形態で述べたように、上記のデータ分析装置の処理機能は、コンピュータによって実現することができる。この場合、データ分析装置が有すべき機能の処理内容はプログラムによって記述され、このプログラムをコンピュータで実行することにより、上記処理機能をコンピュータ上で実現することができる。
【００７２】
また、この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよいが、具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を用いることができる。
【００７３】
また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。
このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。
【００７４】
なお、上記におけるプログラムとは、電子計算機に対する指令であって、一の結果を得ることができるように組合されたものをいい、その他電子計算機による処理の用に供する情報であってプログラムに準ずるものをも含むものとする。
【００７５】
【発明の効果】
以上説明したようにこの発明では、コンテンツデータによって特定される文字列を、パターン認識処理によって抽出し、この抽出された文字列を、意味的にまとまりのある単位に分割し、その分割点の時系列的な位置（話題境界点）を特定する話題境界点データを抽出し、また、このコンテンツデータによって特定されるコンテンツの物理的な変化点の時系列的な位置（信号変化点）を特定する信号変化点データを抽出する。そして、この抽出された話題境界点データと、信号変化点データとを用い、コンテンツの内容的な境界点の時系列的な位置（ストーリー境界点）を特定することとした。そのため、コンテンツから音声認識結果が得られない区間においても、ストーリー構造を容易に抽出することが可能となる。
【図面の簡単な説明】
【図１】データ分析装置の概略を例示した概念図。
【図２】データ分析方法を説明するための概念図。
【図３】データ分析装置のハードウェア構成を例示したブロック図。
【図４】データ分析装置の処理機能を例示したブロック図。
【図５】記憶部に記憶されるデータの構成を例示した図。
【図６】重み係数記憶部に記憶されたデータの構成を例示した図。
【図７】データ分析装置によって行われるデータ分析方法を説明するためのフローチャート。
【図８】図７におけるステップＳ９の処理の詳細を説明するためのフローチャート。
【図９】図７におけるステップＳ１０の詳細を説明するためのフローチャート。
【図１０】ストーリー境界点データを使用する際のデータ分析装置の機能構成を例示したブロック図。
【図１１】ストーリー構造表示画面を例示した図。
【図１２】データ分析装置の処理機能を例示したブロック図。
【図１３】境界候補区間重み係数記憶部に記憶されたデータの構成を例示した図。
【図１４】図７のステップＳ１０の詳細を説明するためのフローチャート。
【図１５】データ分析装置の処理機能を例示したブロック図。
【図１６】イベント順位記憶部に記憶されたデータの構成を例示した図。
【図１７】図７のステップＳ１０の詳細を説明するためのフローチャート。
【符号の説明】
１、２０、３００、４００データ分析装置
３文字列抽出手段
４話題境界点抽出手段
５信号変化点抽出手段
６ストーリー境界点抽出手段[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a data analysis apparatus, a data analysis method, and a data analysis program for analyzing a story structure of content data by a computer system, and in particular, data analysis for analyzing a story structure of multimedia content data including video / audio data. The present invention relates to an apparatus, a data analysis method, and a data analysis program.
[0002]
[Prior art]
Conventionally, when a multimedia content composed of video and audio is configured to be searchable, the content is manually divided into units of information (hereinafter referred to as “story structure”), and the classified information is content. It was common to describe it inside.
However, it takes a lot of labor to extract a story structure by such a manual operation, and it has been extremely difficult to generate a story structure for a large amount of content.
[0003]
Under such circumstances, a video data search support method has been proposed in which an acoustic signal in content is recognized by speech and the story structure is automatically extracted based on the extracted character string (see Patent Document 1). Moreover, as a method of dividing this character string into semantic unit units, for example, a method disclosed in Patent Document 2 is disclosed.
[0004]
[Patent Document 1]
JP-A-5-342263
[Patent Document 2]
JP 2002-342324 A
[0005]
[Problems to be solved by the invention]
However, since the method disclosed in Patent Document 1 recognizes an audio signal in content and extracts a story structure based on the extracted character string, a speech recognition result from the content cannot be obtained, and the character string is There is a problem that the story structure cannot be extracted in the section that could not be extracted.
Also, it takes a lot of labor to manually extract the story structure in the section where the speech recognition result was not obtained.
This invention is made in view of such a point, and it aims at providing the data analyzer which can extract a story structure easily also in the area where the speech recognition result is not obtained from content. To do.
[0006]
Another object of the present invention is to provide a data analysis method capable of easily extracting a story structure even in a section where a speech recognition result cannot be obtained from content.
Furthermore, another object of the present invention is to provide a data analysis program for causing a computer to execute a function capable of easily extracting a story structure even in a section where a speech recognition result cannot be obtained from content. It is to be.
[0007]
[Means for Solving the Problems]
In the present invention, in order to solve the above-described problem, a character string specified by content data is extracted by pattern recognition processing, and the extracted character string is divided into semantically coherent units. The topic boundary point data for specifying the time-series position (topic boundary point) is extracted. Further, signal change point data for specifying a time-series position (signal change point) of a physical change point of the content specified by the content data is extracted. Then, using the extracted topic boundary point data and signal change point data, the time-series position (story boundary point) of the content boundary point of the content is specified.
[0008]
In the present invention, preferably, when the score of the topic boundary point belonging to the search section having a time-series width and the score of the signal change point belonging to the search section satisfy a predetermined condition, This search section is set as a story boundary candidate section, and a topic boundary point or a signal change point belonging to the story boundary candidate section extracted for each story boundary candidate section is set as a story boundary point.
In the present invention, it is preferable that when the value obtained by weighting the score of the topic boundary point and the score of the signal change point by weighting with the first weighting coefficient reaches a predetermined threshold value, the topic The search section to which the boundary point or signal change point belongs is set as a story boundary candidate section.
[0009]
Further, in the present invention, preferably, for each story boundary candidate section, the scores of the topic boundary points and the signal change points belonging to the story boundary candidate section are weighted with the second weighting coefficient and compared. The topic boundary point or the signal change point whose score is the maximum in the story boundary candidate ward is set as the story boundary point.
In the present invention, it is preferable that the second weighting coefficient is different from the first weighting coefficient.
Furthermore, in the present invention, preferably, the topic boundary point or signal change point with the highest priority is set as the story boundary point from the topic boundary points and signal change points belonging to the story boundary candidate section.
[0010]
In the present invention, it is preferable that the topic boundary point score is a score generated using the score in the pattern recognition process.
Further, in the present invention, preferably, the score of the topic boundary point is a score generated using the score when extracting the topic boundary point data.
In the present invention, it is preferable that the signal change point score is a score generated by using the score when the signal change point data is extracted.
[0011]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, a first embodiment of the present invention will be described with reference to the drawings. In the following, the outline of this embodiment will be described first, and then the details will be described.
FIG. 1 is a conceptual diagram illustrating an outline of the data analysis apparatus 1 in this embodiment, and FIG. 2 is a conceptual diagram for explaining the data analysis method. The configuration of the data analysis apparatus 1 of this example and the outline of the data analysis method will be described below with reference to FIGS.
[0012]
The data analysis apparatus 1 is configured, for example, by causing a personal computer to execute a predetermined program, and the content data storage unit 2 stores a time series (t = 0 to n in FIG. 2) consisting of video and audio. Contents data 10 having the same is stored. When data analysis is performed by the data analysis apparatus 1 of this example, first, the content data 10 is read from the content data storage means 2 by the character string extraction means 3, and the character string specified by this is extracted by pattern recognition processing. Here, “pattern recognition processing” refers to processing for pattern recognition of voice and video and extraction of character strings (for example, voice recognition processing, character recognition processing, etc.). The extracted character string information is sent to the topic boundary point extracting means 4, where the extracted character string is divided into units that are semantically coherent, and the time series position ( Topic boundary point data 11 specifying the topic boundary points 11a to 11e) is extracted.
[0013]
In this example, the content data 10 is read from the content data storage means 2 by the signal change point extraction means 5, and the time-series position (acoustic signal change) of the physical change point of the content specified by the content data 10 is read. Signal change point data (acoustic signal change point data 12 and video signal change point data 13) specifying points 12a to 12i and video signal change points 13a to 13i) are extracted.
These extracted topic boundary point data 11 and signal change point data (sound signal change point data 12 and video signal change point data 13) are sent to the story boundary point extraction means 6, where Using point data 11 and signal change point data (acoustic signal change point data 12 and video signal change point data 13), the time-series positions (story boundary points 15a to 15d) of specific boundary points of content are specified. Is done.
[0014]
Although details will be described later, in this example, the story boundary points 15a to 15d are specified as follows.
First, when a score belonging to a search section (t to t + d) at a certain time t satisfies a predetermined condition, that is, the search section 16 having a time-series width (d) is moved in time series. And the scores of the topic boundary points 11a to 11e belonging to the search section 16 and the scores of the signal change points (acoustic signal change points 12a to 12i and video signal change points 13a to 13i) belonging to the search section 16. When a predetermined condition is satisfied, the search section 16 is set as the story boundary candidate sections 14a to 14d (story boundary candidate section data 14). Then, the topic boundary points or signal change points belonging to the story boundary candidate sections 14a to 14d extracted for the story boundary candidate sections 14a to 14d are set as story boundary points 15a to 15d (story boundary point data 15).
[0015]
As described above, in the example of this embodiment, since the story boundary points 15a to 15d are extracted using the voice recognition result and the extraction result of the change point of the audio signal or the video signal, the voice recognition result is obtained from the content. Even in a section that was not obtained, the story structure can be easily extracted.
Next, details of this embodiment will be described.
FIG. 3 is a block diagram illustrating a hardware configuration of the data analysis apparatus 20 in this embodiment.
As illustrated in FIG. 3, the data analysis device 20 includes a content data storage unit 21 that stores content data, a CPU (Central Processing Unit) 22, an input unit 23 such as a keyboard and a mouse, and a story that stores story boundary point data. There is a boundary point data storage unit 24, a storage unit 25 for storing various data, a data analysis program memory 26 for storing a data analysis program, a display unit 27 such as a liquid crystal display, and a bus 28 for connecting these units so as to exchange data. is doing.
[0016]
FIG. 4 is a block diagram illustrating processing functions of the data analysis apparatus 20 constructed by executing the data analysis program stored in the data analysis program memory 26 in the hardware configuration illustrated in FIG. 5 illustrates the configuration of data stored in the storage unit 20e, and FIG. 6 illustrates the configuration of data stored in the weight coefficient storage unit 20i. Furthermore, FIG. 7 is a flowchart for explaining a data analysis method performed by the data analysis apparatus 20.
Hereinafter, the functional configuration of the data analysis apparatus 20 in this embodiment and the data analysis method thereof will be described with reference to FIGS. In the following explanation, voice pause data (VOP), voice recognition result data (VOR), topic boundary point data (VORB), acoustic signal change point data (BGC), acoustic pause data (BGP), video signal change point data (VIC) is a subscript that distinguishes each piece of data from one or more of each voice data (VO), acoustic data (BG), or image data (VI) that are extracted and processed. Is omitted (the same applies to the second and third embodiments).
[0017]
First, voice data (VO) is extracted from the content data storage unit 20a (step S1), and this voice data (VO) is sent to the voice recognition unit 20c. Next, in this voice recognition unit 20c, voice pause data (VOP) 100 of this voice data (VO) is extracted (step S2), and further, the voice of the voice specified by the voice data (VO) is obtained by voice recognition processing. A character string and its voice recognition score (voice recognition result data (VOR)) are extracted (step S3).
Here, “pause” means a silent state (for example, a state where the sum of squares of the amplitude of sound information is equal to or less than a predetermined threshold) for a predetermined time or more (in this example, 0.5%) The voice pause data (VOP) 100 in this example is the time (VOP1) 101 of the voice pause portion specified by the voice data (VO) and its duration (msec). This data consists of pause length (VOP2) 102. Note that “time” means a time-series position of content having a time series, and is a concept including not only the reproduction time viewed from the top of the content but also a count number indicating the time-series position of the content. .
[0018]
In addition, the “speech recognition score” in this example includes an acoustic score indicating a match with an acoustic model at the time of speech recognition, a language score indicating a match with a language model indicating a sentence-like connection of extracted character strings, and a speech recognition result. The reliability score (for example, a value based on the number of confrontation candidates and the score difference between the selected word and the confrontation candidate) is configured, and the greater the value of this “voice recognition score”, the more It shows that the result is accurate.
In this example, the voice pause data (VOP) extracted by the voice recognition unit 20c is sent to the storage unit 20e and stored therein (FIG. 5), and the topic boundary along with the voice recognition result data (VOR). It is sent to the point extraction unit 20d.
[0019]
The topic boundary point extraction unit 20d to which the information is sent extracts topic boundary point data (VORB) 110 based on the information (step S4), and extracts the extracted topic boundary point data (VORB) 110. The data is sent to the storage unit 20e and recorded in the storage unit 20e (FIG. 5).
Note that the topic boundary point data (VORB) 110 in this example includes the time (VORB1) 111 of the division point when the character string of the speech recognition result data (VOR) is divided into semantically unity, and the character This is data consisting of a boundary score (VORB2) 112 (topic boundary point score) given when a column is divided into semantically organized units.
[0020]
In addition, the topic boundary point data (VORB) 110 is extracted by setting, for example, the “text unit of content” partitioned by the time (VOP1) 101 of the voice pause data (VOP) 100 as the minimum unit to be divided. This is performed using the method described in JP-A-2002-342324. That is, first, a word vector corresponding to each word included in the character string of the speech recognition result data (VOR) is acquired by searching a concept base in which a word vector expressing the meaning of the word is stored. This word vector is a vector in which values are set such that the distance between words that are semantically similar is closer, and the distance between words that are not semantically similar is longer. Next, a word string which is a set of a predetermined number of words is taken before and after a word boundary (word boundary) partitioned by the time (VOP1) 101 of the voice pause data (VOP) 100, and the words constituting each word string The word string cohesion degree (similarity measure or distance measure of the preceding and following word strings) is calculated from the information of the word vectors. That is, for example, the similarity measure or word string cohesion degree of the preceding and following word strings is obtained from the sum or the center of gravity of these word vectors. If the word string cohesion is a similarity measure, the minimum word boundary is recognized as a topic boundary point.
[0021]
Further, in this example, the boundary score (VORB2) 112 is a score generated using the score (word string cohesion degree) when extracting the topic boundary point data (VORB) 110, and the word string cohesion degree is In the case of the similarity scale, the larger the word string cohesion is, the larger the value is. When the word string cohesion is the distance scale, the higher the word string cohesion is, the smaller the score is.
In addition, words whose “voice recognition score” value at the time of speech recognition is not more than a predetermined value are excluded from the evaluation target at the time of topic boundary point data (VORB) 110 extraction, or at the time of topic boundary point data (VORB) 110 extraction. The voice recognition score may be reflected on the boundary score (VORB2) 112 by weighting a word corresponding to the “voice recognition score” (in this case, the boundary score (VORB2) 112 is used during pattern recognition processing). Can be referred to as a score generated using the score in
[0022]
Next, in this example, the acoustic data (BG) is extracted from the content data storage unit 20a (step S5), and this acoustic data (BG) is sent to the acoustic signal change point extraction unit 20f. The acoustic signal change point extraction unit 20f extracts the acoustic signal change point data (BGC) 120 and the acoustic pause data (BGP) 130 (step S6), and sends them to the storage unit 20e. 20e is stored (FIG. 5). Here, the acoustic signal change point data (BGC) 120 is the time (BGC1) 121 of the physical change point of the acoustic signal included in the acoustic data (BG), and the change score attached when extracting this change point. (BGC2) 122, and the acoustic pose data (BGP) 130 is the time (BGP1) 131 of the sound pose part (speech / music start / end, etc.) specified by the acoustic data (BG). And pause length (BGP2) 132, which is the duration (msec) thereof.
[0023]
For example, the following method is used to extract the time (BGC1) of the physical change point of the acoustic signal. That is, first, a model representing a voice segment and a model representing a music segment are learned using data of the voice segment and the music segment. Then, applying these models, scoring the input signal (scoring indicating the degree of coincidence with each model), and the score (speech segment score) given when it matches the speech segment model, The change point is the point where the magnitude relationship with the score (music segment score) given when it matches the music segment model is reversed. At this time, the change score (BGC2) is generated using, for example, the difference between the “voice section score” and the “music section score” at the time (BGC1), and the value of the change score (BGC2) increases as this difference increases. Generated to be large.
[0024]
Further, for example, the following method is used for learning a model representing a speech section. That is, when speech is present, a band-like spectrum can be observed that is an integer multiple or close to that in the frequency direction. Therefore, a comb filter having an appropriate interval in the frequency direction is prepared, and the sum of spectral power at the top of the comb is obtained while changing the interval of the comb or moving in the frequency direction. Then, when speech is present and harmonics are present, the sum of the spectrum power becomes large. Therefore, an area where the sum of the spectrum power exceeds a predetermined threshold is learned as a model representing a speech section ( For example, refer to JP-A-8-177991).
[0025]
For example, the following method is used for learning a model representing a music section. That is, the frequency spectrum of the sound information is calculated, its cepstrum is obtained, and further, the average duration of the trajectory without fluctuation in the frequency direction of the cepstrum is calculated, and an area where the average duration exceeds a predetermined threshold value is calculated. And learning as a model representing a music section (see, for example, Japanese Patent Application Laid-Open No. 8-179791).
For the acoustic pause data (BGP) 130, for example, the time when the sum of the squares of the amplitudes of the sound information is equal to or less than a predetermined threshold is the time (BGP1) 131, and the duration is the pause length (BGP2). ) 132.
[0026]
Next, in this example, video data (VI) is extracted from the content data storage unit 20a (step S7), and this video data (VI) is sent to the video signal change point extraction unit 20g. Then, the video signal change point extraction unit 20g extracts the video signal change point data (VIC) 140 (step S8), sends it to the storage unit 20e, and records it in the storage unit 20e (FIG. 5). Here, the video signal change point data (VIC) 140 is the time (VIC1) 141 of the physical change point of the audio signal included in the video data (VI), and the change score given when this change point is extracted. (VIC2) 142 data.
[0027]
In addition, as a method of extracting the time (VIC1) 141 of the physical change point of the sound signal included in the video data (VI), for example, two images that are adjacent in time to the series of captured video data (VI). Image I _t , I _t-1 When the sum of the absolute values over the entire screen (difference between frames D (t)) exceeds a predetermined threshold value, this t is calculated as the change point. There is a method of setting the time (VIC1) 141 (Otsuki, Tonomura, Ohba: “Browsing of moving images using luminance information”, IEICE Technical Report, IE90-103, 1991). At this time, the change score (VIC2) 142 is generated using, for example, an inter-frame difference D (t) (corresponding to a score when extracting signal change point data).
[0028]
Also, instead of inter-frame differences, pixel change area, luminance histogram difference, block-specific color correlation, χ ² If the test amount etc. is D (t), and this D (t) exceeds a predetermined threshold value, this t may be the time of change (VIC1) 141 (Otsuka, Tonomura: “Automatic video cut” "Examination of detection method", Television Society Technical Report, Vol.16, No.43, pp.7-12). At this time, the change score (VIC2) 142 includes, for example, a pixel change area, a luminance histogram difference, a color correlation for each block, χ ² It is generated using a test amount or the like (corresponding to a score when extracting signal change point data).
Further, instead of thresholding D (t) as it is, the result of applying various time filters to D (t) is thresholded to obtain the time (VIC1) 141 of the change point. (See K. Otsuji snd Y. Tonomura: “Projection Detecting Filter for Video Cut Detection” Proc. Of ACM Multimedia 93, 1993, pp. 251-257).
[0029]
As described above, the voice pose data (VOP) 100, the topic boundary point data (VORB) 110, the acoustic signal change point data (BGC) 120, the acoustic pose data (BGP) 130, and the video signal change stored in the storage unit 20e. The point data (VIC) 140 is sent to the story boundary candidate section extraction unit 20h. The story boundary candidate section extraction unit 20h uses the information (VOP, VORB, BGC, BGP, VIC) and the weight coefficient (W) 162 read from the weight coefficient storage unit 20i, and uses the story boundary candidate section data (STS). ) 150 is extracted (step S9). Details of the process in step S9 will be described later. Further, as illustrated in FIG. 6, this weighting coefficient (W) 162 is a coefficient assigned to each event 161 such as topic boundary point data (VORB). In this example, “topic boundary point data ( “0.5” for “VORB)”, “0.2” for “audio signal change point data (BGC)”, and “0.05” for “video signal change point data (VIC)”. "Is assigned to" voice pose data (VOP) "and" 0.05 "is assigned to" acoustic pose data (BGP) ".
[0030]
The story boundary candidate section data (STS) 150 includes a start point time (STS1) 151 that is the start time of the section and an end point time (STS2) 152 that is the end time of the section. The story boundary candidate section data (STS) 150 generated by the section 20h is sent to the storage section 20e and recorded there (FIG. 5).
Next, in the story boundary point extraction unit 20j, the voice pose data (VOP) 100, the topic boundary point data (VORB) 110, the acoustic signal change point data (BGC) 120, and the acoustic pause data (BGP) 130 are stored from the storage unit 20e. , And video signal change point data (VIC) 140 and story boundary candidate section data (STS) 150 are read, weight coefficient (W) 162 is read from weight coefficient storage unit 20i, and story boundary points are read using these data. Data (STB) is extracted (step S10). The story boundary point data (STB) extracted by the story boundary point extraction unit 20j is sent to the story boundary point data storage unit 20k and recorded there.
[0031]
Next, details of the process of step S9 in FIG. 7 will be described.
FIG. 8 is a flowchart for explaining details of the process in step S9 in FIG. The details of the process of step S9 performed by the story boundary candidate section extraction unit 20h will be described below with reference to this flowchart.
In this process, a search section (in this example, a section having a time width of 2 seconds from time ST1) is moved from 0 to n in 1 second steps, and voice pause data (VOP) belonging to the search section 16 of each step. 100, topic boundary point data (VORB) 110, sound signal change point data (BGC) 120, sound pose data (BGP) 130, and video signal change point data (VIC) 140 are stored in the weight coefficient storage unit 20i. A search section in which the total score (TSC) reaches a threshold value is extracted as story boundary candidate section data (STS) by weighting with a weighting coefficient (W) and adding.
[0032]
Specifically, first, ST1, TSC, and q (subscript) are reset to 0 (step S11), and it is determined whether ST1 ≦ VORB1 ≦ ST1 + 2 is satisfied (step S12). That is, it is determined whether a topic boundary point exists in the search section. If ST1 ≦ VORB1 ≦ ST1 + 2 is satisfied, TSC ← TSC + VORB2 * 0.5 is calculated for all topic boundary point data (VORB) satisfying this (step S13), and the process proceeds to step S14. If not, the process proceeds to step S14 without performing this calculation. Note that “0.5” multiplied here is the weighting coefficient (W) assigned to the topic boundary point data (VORB) as described above.
[0033]
Next, it is determined whether or not ST1 ≦ BGC1 ≦ ST1 + 2 is satisfied (step S14). Here, when ST1 ≦ BGC1 ≦ ST1 + 2 is satisfied, the calculation of TSC ← TSC + BGC2 * 0.2 is performed on all the acoustic signal change point data (BGC) satisfying this (step S15) and step S16 is performed. If not satisfied, the process proceeds to step S16 without performing this calculation. Note that “0.2” multiplied here is the weighting coefficient (W) assigned to the acoustic signal change point data (BGC) as described above.
Next, it is determined whether or not ST1 ≦ VIC1 ≦ ST1 + 2 is satisfied (step S16). If ST1 ≦ VIC1 ≦ ST1 + 2 is satisfied, TSC ← TSC + VIC2 * 0.05 is calculated for all video signal change point data (VIC) satisfying ST1 ≦ VIC1 ≦ ST1 + 2 (step S17) to step S18. If not satisfied, the process proceeds to step S18 without performing this calculation. Note that “0.05” multiplied here is the weighting coefficient (W) assigned to the video signal change point data (VIC) as described above.
[0034]
Next, it is determined whether or not ST1 ≦ VOP1 ≦ ST1 + 2 is satisfied (step S18). Here, when ST1 ≦ VOP1 ≦ ST1 + 2 is satisfied, the calculation of TSC ← TSC + (VOP2 / 3000) * 0.05 is performed for all voice pause data (VOP) satisfying this (step S19). If not satisfied, the process proceeds to step S20 without performing this calculation. Note that “0.05” multiplied here is the weighting coefficient (W) assigned to the voice pause data (VOP) as described above.
Next, it is determined whether or not ST1 ≦ BGP1 ≦ ST1 + 2 is satisfied (step S20). Here, when ST1 ≦ BGP1 ≦ ST1 + 2 is satisfied, the calculation of TSC ← TSC + (BGP2 / 1000) * 0.05 is performed for all acoustic pose data (BGP) satisfying this (step S21) and step S22. If not satisfied, the process proceeds to step S22 without performing this calculation. Note that “0.05” multiplied here is the weighting coefficient (W) assigned to the acoustic pose data (BGP) as described above.
[0035]
Then, it is determined whether or not the total score TSC calculated as described above is equal to or greater than a threshold value 0.3 (step S22). Here, if TSC is 0.3 or more, STS1 _q ST1 to STS2 _q Is substituted for ST1 + 2 (step S23), and the calculation result (STS1 _q , STS2 _q ) Is stored as story candidate section data (STS) 150 in the storage unit 20e (step S24), and the process proceeds to step S25. Note that the subscript q is calculated as q ← q + 1 every time the process of step S24 is performed. On the other hand, if TSC is not 0.3 or more, the process proceeds directly to step S25 without performing these calculations and recordings.
[0036]
In this step S25, it is determined whether or not ST1 + 2 ≧ n (that is, whether or not the end point ST2 of the search section has reached the content end time n). If ST1 + 2 ≧ n, processing is performed. If ST1 + 2 ≧ n is not satisfied, 1 is added to ST1, the total score TSC is reset to 0 (step S26), and the process returns to step S12.
Next, details of the processing in step S10 in FIG. 7 will be described.
FIG. 9 is a flowchart for explaining details of step S10 in FIG. The details of the process of step S10 performed by the story boundary point extraction unit 20j will be described below with reference to this flowchart.
[0037]
In this process, for each story boundary candidate section (STS1-STS2 section), voice pose data (VOP) 100, topic boundary point data (VORB) 110, acoustic signal change point data (BGC) belonging to this story boundary candidate section 120, each score (VOP2, VORB2, BGC2, BGP2, VIC2) of the audio pose data (BGP) 130 and the video signal change point data (VIC) 140 is weighted by the weight coefficient (W) of the weight coefficient storage unit 20i. And comparing, the score after weighting was the largest in the story boundary candidate ward, voice pose data (VOP) 100, topic boundary point data (VORB) 110, sound signal change point data (BGC) 120, sound The time (VOP1, VORB1, BGC1, BGP1, VIC1) of pause data (BGP) 130 or video signal change point data (VIC) 140 is set as a story boundary point. Note that the processing illustrated in FIG. 9 is processing performed for each story boundary candidate section for each story boundary candidate section, and the subscript q of STS1 and STS2 is omitted in the following.
[0038]
In this case, first, SC ₁ ~ SC _m , And k are reset to 0 (step S31). Where "SC ₁ ~ SC _m "Is a variable that substitutes a weighted score corresponding to each event such as topic boundary data (VORB) 110, and k is" SC ₁ ~ SC _m "Is an integer indicating a subscript of"", and m is a sufficiently large integer.
Next, it is determined whether or not STS1 ≦ VORB1 ≦ STS2 is satisfied (step S32). That is, it is determined whether a topic boundary point exists in the story boundary candidate section. Here, when STS1 ≦ VORB1 ≦ STS2 is satisfied, all topic boundary point data (VORB) satisfying this is SC. _k ← VORB2 * 0.5 is calculated, and k ← k + 1 is calculated for each calculation (step S33). Note that “0.5” multiplied here is the weighting coefficient (W) assigned as described above. This calculation result SC _k Is stored in the storage unit 20e in association with the topic boundary point data (VORB) corresponding thereto, and the process proceeds to step S34. On the other hand, if STS1 ≦ VORB1 ≦ STS2 is not satisfied, the process proceeds to step S34 without performing this calculation / recording.
[0039]
Next, it is determined whether or not STS1 ≦ BGC1 ≦ STS2 is satisfied (step S34). Here, when STS1 ≦ BGC1 ≦ STS2 is satisfied, SC for all acoustic signal change point data (BGC) satisfying this _k ← BGC2 * 0.2 is calculated, and k ← k + 1 is calculated for each calculation (step S35). Note that “0.2” multiplied here is the weighting coefficient (W) assigned as described above. This calculation result SC _k Is stored in the storage unit 20e in association with the corresponding acoustic signal change point data (BGC), and the process proceeds to step S36. On the other hand, if STS1 ≦ BGC1 ≦ STS2 is not satisfied, the operation proceeds to step S36 without performing this calculation / recording.
[0040]
Next, it is determined whether or not STS1 ≦ VIC1 ≦ STS2 is satisfied (step S36). Here, when STS1 ≦ VIC1 ≦ STS2 is satisfied, SC for all video signal change point data (VIC) satisfying this _k ← VIC2 * 0.05 is calculated, and k ← k + 1 is calculated for each calculation (step S37). Note that “0.05” multiplied here is the weighting coefficient (W) assigned as described above. This calculation result SC _k Is stored in the storage unit 20e in association with the corresponding video signal change point data (VIC), and the process proceeds to step S38. On the other hand, if STS1 ≦ VIC1 ≦ STS2 is not satisfied, the process proceeds to step S38 without performing this calculation / recording.
[0041]
Next, it is determined whether or not STS1 ≦ VOP1 ≦ STS2 is satisfied (step S38). Here, when STS1 ≦ VOP1 ≦ STS2 is satisfied, SC for all voice pause data (VOP) satisfying this _k ← (VOP2 / 3000) * 0.05 is calculated, and k ← k + 1 is calculated for each calculation (step S39). Note that “0.05” multiplied here is the weighting coefficient (W) assigned as described above. This calculation result SC _k Is stored in the storage unit 20e in association with the corresponding voice pose data (VOP), and the process proceeds to step S40. On the other hand, if STS1 ≦ VOP1 ≦ STS2 is not satisfied, the operation proceeds to step S40 without performing this calculation / recording.
[0042]
Next, it is determined whether or not STS1 ≦ BGP1 ≦ STS2 is satisfied (step S40). Here, when STS1 ≦ BGP1 ≦ STS2 is satisfied, SC for all acoustic pose data (BGP) satisfying this _k ← (BGP2 / 1000) * 0.05 is calculated, and k ← k + 1 is calculated for each calculation (step S41). Note that “0.05” multiplied here is the weighting coefficient (W) assigned as described above. This calculation result SC _k Is stored in the storage unit 20e in association with the corresponding acoustic pose data (BGP), and the process proceeds to step S42. On the other hand, if STS1 ≦ BGP1 ≦ STS2 is not satisfied, the process proceeds to step S42 without performing this calculation / recording.
[0043]
Then, each score SC memorize | stored in the memory | storage part 20e _p Read (0 ≦ p ≦ k) and read SC _p Then, the maximum value is detected by a sorting algorithm or the like. And its biggest SC _p The event time (VORB1, BGC1, VIC1, VOP1, BGP1) corresponding to (associated with) is set as story boundary point data (STB), and the value is recorded in the story boundary point data storage unit 20k (step S42). ).
Next, a usage example of the story boundary point data (STB) extracted in this way will be described.
FIG. 10 is a block diagram illustrating a functional configuration of the data analysis apparatus 20 when using story boundary point data (STB).
[0044]
In this example, it is assumed that the content data storage unit 20a records voice recognition result data (VOR) generated by the voice recognition unit 20c (FIG. 4) in addition to the data described above.
In the case of this example, first, the control unit 20n accesses the content data storage unit 20a, the storage unit 20e, and the story boundary point data storage unit 20k, and receives content data from the content data storage unit 20a and audio pause data from the storage unit 20e. (VOP), acoustic pose data (BGP), and speech recognition result data (VOR), story boundary point data (STB) is read from the story boundary point data storage unit 20k, and these information is used to illustrate them in FIG. The story structure display screen 200 to be generated is generated. The generated story structure display screen 200 is sent to the display unit 20m, and the display unit 20m displays it.
[0045]
As illustrated in FIG. 11, the story structure display screen 200 includes a plurality of story sections 210 to 230 obtained by dividing content at story boundary points. Each story category 210 to 230 has a story range display unit 211 for displaying the start time (“start”) and “length” of the story, “story detail display”, “metadata display”, and “story playback”. , An icon 212 that accepts a click operation of “subtitle playback”, a summary display section 231 that displays a summary of a story (“summary”) and its keyword (“topic”), and a head still video of content and a head still video after pause Is composed of pause boundary screens 214a to 214h.
[0046]
In this example, the story sections 210 to 230 are determined by dividing the content at the story boundary points using the story boundary point data (STB), and the “start” displayed in the story range display unit 211 is a story. Depending on the time indicated by the boundary point data (STB) (the start time of each story segment 210-230), the “length” indicates the start time and end time of each story category 210-230 indicated by the story boundary point data (STB). Are generated by the difference of The “summary” and “topic” of the summary display unit 213 are generated by processing the voice recognition result data (VOR) corresponding to each story section 210 to 230 for each story section 210 to 230. The pause boundary screens 214a to 214h are the still image of the content at the time when only the pause length (VOP2) 102 has elapsed from the time (VOP1) 101 of the voice pause data (VOP), and the time of the acoustic pause data (BGP) ( It is generated by arranging still images of contents at the time when the pause length (BGP2) 132 has elapsed from the BGP1) 131 in chronological order.
[0047]
The user who browses the story structure display screen 200 displayed in this way performs a click operation on the icon 212 of the story categories 210 to 230 desired to be browsed by mouse input or the like from the input unit 20p. This input information is transmitted to the control unit 20n, and when the control unit 20n detects a click operation on the icon 212, the control unit performs processing corresponding to it, for example, the content of the selected story category is stored in the content data storage unit. A process of reading from 20a and sending the content to the display unit 20m for display is performed.
In addition, using the keyword input by the user to the input unit 20p, the control unit 20n searches for a character string indicated by the voice recognition result data (VOR), displays the story classification having the keyword on the display unit 20m, and uses it. When the user inputs to the input unit 20p that he / she wants to browse the content of the story category displayed on the display unit 20m, the content of the story category may be displayed on the display unit 20m.
[0048]
As described above, in this embodiment, the character string specified by the content data is extracted by the speech recognition process (step S3), and the extracted character string is divided into semantically unity topics. Boundary point data (VORB) is extracted (step S4), and signal change point data (VOP, BGC, BGP) that specifies the time-series position (signal change point) of the physical change point of the content specified by the data , VIC) (steps S2, S6, S8), and using this topic boundary point data (VORB) and signal change point data (VOP, BGC, BGP, VIC), story boundary point data (STB) is extracted. It was decided to extract. For this reason, even if the speech recognition result is not obtained from the content, it is possible to easily extract the story structure by using other signal change point data (VOP, BGC, BGP, VIC). .
[0049]
In this form, story boundary points are extracted using not only topic boundary point data (VORB) but also signal change point data (VOP, BGC, BGP, VIC). Accuracy and accuracy are improved.
Furthermore, in this form, in the story boundary candidate section extraction unit 20h (which constitutes the story boundary extraction means), the score of the topic boundary point belonging to the search section having a time-series width and the signal change point belonging to this search section If this score satisfies a predetermined condition, this search section is set as a story boundary candidate section (step S9), and each story boundary candidate section is selected in the story boundary point extraction unit 20j (which constitutes the story boundary extraction means). The topic boundary point or signal change point belonging to the story boundary candidate section extracted in step S10 is set as the story boundary point (step S10). As a result, a plurality of topic boundary points and signal change points correspond to one story boundary point of the content, and these topic boundary points and signal change points do not completely match (due to errors caused by detection methods, etc.). However, it is possible to prevent all detected topic boundary points and signal change points from being determined as story boundary points. As a result, it is possible to accurately and realistically extract story boundary points from a plurality of topic boundary points and signal change points.
[0050]
Further, in this embodiment, the sum of the topic boundary point score (VORB2) and the signal change point score (VOP2, BGC2, BGP2, VIC2) weighted by the weighting factor (W) is a predetermined threshold. When the value is reached, the search section to which the topic boundary point or signal change point belongs is set as a story boundary candidate section (steps S11 to S26). As a result, the detection reliability of the topic boundary point and signal change point is reflected in the detection of the story boundary candidate section, and the story boundary candidate section of the topic boundary point and signal change point is detected by adjusting the weight coefficient (W). Can be set freely. As a result, it is possible to suitably detect an appropriate story boundary candidate section.
[0051]
In addition, in this form, for each story boundary candidate section, the topic boundary points and signal change point scores (VORB2, VOP2, BGC2, BGP2, VIC2) belonging to this story boundary candidate section are weighted with a weighting coefficient (W). Go and compare and this weighted score (SC _k However, the topic boundary point or signal change point that is the largest in the story boundary candidate ward is set as the story boundary point (steps S31 to S42). As a result, the story boundary point can be determined using the most reliable data in the story boundary candidate wards, and the story boundary point can be appropriately extracted.
Furthermore, in this mode, the score of the topic boundary point (VRB2) is generated using the score at the time of the voice recognition process, so the reliability of the voice recognition process can be reflected in the extraction process of the story boundary point. It becomes. As a result, the influence of voice recognition errors can be reduced, and appropriate story boundary points can be extracted.
[0052]
In addition, in this form, the topic boundary point score (VRB2) is generated using the score at the time of topic boundary point data extraction, so the reliability of the topic boundary point extraction process is the story boundary point extraction process. It is possible to reflect on. As a result, the influence of topic boundary point extraction errors can be reduced, and appropriate story boundary points can be extracted.
Further, in this embodiment, the signal change point score (VORB2, VOP2, BGC2, BGP2, VIC2) is generated using the score when the signal change point data is extracted. Reliability can be reflected in the extraction process of story boundary points. As a result, it is possible to reduce the influence of an error in the story boundary point extraction process and to extract an appropriate story boundary point.
[0053]
Next explained is the second embodiment of the invention.
This form is a modification of the first embodiment. The weight coefficient (first weight coefficient) used when story boundary candidate section data is extracted (step S9) and the story boundary point data are extracted (step S10). The difference from the first embodiment is that the weighting coefficient used (second weighting coefficient) is a different coefficient. Hereinafter, only points different from the first embodiment will be described, and description of matters common to the first embodiment will be omitted. In this embodiment, when the same functional configuration as that of the first embodiment is illustrated, the same reference numerals as those of the first embodiment are used.
[0054]
FIG. 12 is a block diagram exemplifying processing functions of the data analysis apparatus 300 of the present embodiment constructed by executing a data analysis program stored in the data analysis program memory 26 in the hardware configuration illustrated in FIG. is there.
The difference between the configuration of this embodiment and the configuration of the first embodiment is that the weighting factor used when extracting story boundary candidate section data (step S9) instead of the weighting factor storage unit 20i of the first embodiment. A boundary candidate section weighting coefficient storage unit 301 storing (W) and a boundary point weighting coefficient storage unit 302 storing a weighting coefficient (W ′) used when story boundary point data is extracted (step S10) are provided. . Note that the boundary candidate interval weight coefficient storage unit 301 is configured to be able to provide the weight coefficient (W) to the story boundary candidate interval extraction unit 20h, and the boundary point weight coefficient storage unit 302 includes the story boundary point extraction unit 20j. The weight coefficient (W ′) can be provided.
[0055]
FIG. 13 illustrates the configuration of data stored in the boundary candidate section weight coefficient storage unit 301.
As illustrated in FIG. 13, the weight coefficient (W ′) 312 is a coefficient assigned to each event 311 such as topic boundary point data (VORB). In this example, “topic boundary point data (VORB) ) ”,“ 0.3 ”for“ Sound Signal Change Point Data (BGC) ”, and“ 0.3 ”for“ Video Signal Change Point Data (VIC) ”. However, “0.1” is assigned to “voice pose data (VOP)” and “0.1” is assigned to “acoustic pose data (BGP)”.
[0056]
Note that the weighting coefficient (W) stored in the boundary candidate section weighting coefficient storage unit 301 is the same as that illustrated in FIG. 6, and the description thereof is omitted here.
Next, a data analysis method performed by the data analysis apparatus 300 in this embodiment will be described. Note that the processing in steps S1 to S9 described in the first embodiment (FIG. 7) is performed in the same manner for the data analysis method in the second embodiment, and the difference is the weight coefficient storage unit 20i (FIG. 4). ) Is replaced by the boundary candidate section weight coefficient storage unit 301 (FIG. 12). Therefore, the description thereof is omitted here, and only the process of step S10 in FIG. 7 which is a difference from the first embodiment will be described.
[0057]
FIG. 14 is a flowchart for explaining details of step S10 of FIG. 7 in the second embodiment. The details of the processing in step S10 in this form performed by the story boundary point extraction unit 20j (FIG. 4) will be described below with reference to this flowchart.
In this process, each score (VOP2, VORB2, BGC2, BGP2, VIC2) such as voice pose data (VOP) 100 belonging to this story boundary candidate section is weighted for each story boundary candidate section (STS1-STS2 section). The weighting coefficient (W ′) of the coefficient storage unit 302 is weighted and compared, and the time (VOP1, VORB1) of the voice pose data (VOP) 100 or the like whose score after weighting is the largest in the story boundary candidate section , BGC1, BGP1, VIC1) as story boundary points. The process illustrated in FIG. 14 is performed for each story boundary candidate section for each story boundary candidate section, and the subscript q of STS1 and STS2 is omitted in the following.
[0058]
In this case, first, SC ₁ ~ SC _m , And k are reset to 0 (step S51), and it is determined whether or not STS1 ≦ VORB1 ≦ STS2 is satisfied (step S52). Here, when STS1 ≦ VORB1 ≦ STS2 is satisfied, all topic boundary point data (VORB) satisfying this is SC. _k ← VORB2 * 0.2 is calculated, and k ← k + 1 is calculated for each calculation (step S53). Note that “0.2” multiplied here is the weighting coefficient (W ′) assigned as described above. This calculation result SC _k Is stored in the storage unit 20e in association with the topic boundary point data (VORB) corresponding thereto, and the process proceeds to step S54. On the other hand, if STS1 ≦ VORB1 ≦ STS2 is not satisfied, the process proceeds to step S54 without performing this calculation / recording.
[0059]
Next, it is determined whether or not STS1 ≦ BGC1 ≦ STS2 is satisfied (step S54). Here, when STS1 ≦ BGC1 ≦ STS2 is satisfied, SC for all acoustic signal change point data (BGC) satisfying this _k ← BGC2 * 0.3 is calculated, and k ← k + 1 is calculated for each calculation (step S55). Note that “0.3” multiplied here is the weighting coefficient (W ′) assigned as described above. This calculation result SC _k Is stored in the storage unit 20e in association with the corresponding acoustic signal change point data (BGC), and the process proceeds to step S56. On the other hand, if STS1 ≦ BGC1 ≦ STS2 is not satisfied, the operation proceeds to step S56 without performing this calculation / recording.
[0060]
Next, it is determined whether or not STS1 ≦ VIC1 ≦ STS2 is satisfied (step S56). Here, when STS1 ≦ VIC1 ≦ STS2 is satisfied, SC for all video signal change point data (VIC) satisfying this _k ← VIC2 * 0.03 is calculated, and k ← k + 1 is calculated for each calculation (step S57). Note that “0.03” multiplied here is the weighting coefficient (W ′) assigned as described above. This calculation result SC _k Is stored in the storage unit 20e in association with the corresponding video signal change point data (VIC), and the process proceeds to step S58. On the other hand, if STS1 ≦ VIC1 ≦ STS2 is not satisfied, the operation proceeds to step S58 without performing this calculation / recording.
[0061]
Next, it is determined whether or not STS1 ≦ VOP1 ≦ STS2 is satisfied (step S58). Here, when STS1 ≦ VOP1 ≦ STS2 is satisfied, SC for all voice pause data (VOP) satisfying this _k ← (VOP2 / 3000) * 0.1 is calculated, and k ← k + 1 is calculated for each calculation (step S59). Note that “0.1” multiplied here is the weighting coefficient (W ′) assigned as described above. This calculation result SC _k Is stored in the storage unit 20e in association with the corresponding voice pose data (VOP), and the process proceeds to step S60. On the other hand, if STS1 ≦ VOP1 ≦ STS2 is not satisfied, the operation proceeds to step S60 without performing this calculation / recording.
[0062]
Next, it is determined whether or not STS1 ≦ BGP1 ≦ STS2 is satisfied (step S60). Here, when STS1 ≦ BGP1 ≦ STS2 is satisfied, SC for all acoustic pose data (BGP) satisfying this _k ← (BGP2 / 1000) * 0.1 is calculated, and k ← k + 1 is calculated for each calculation (step S61). Note that “0.1” multiplied here is the weighting coefficient (W ′) assigned as described above. This calculation result SC _k Is stored in the storage unit 20e in association with the corresponding acoustic pose data (BGP), and the process proceeds to step S62. On the other hand, if STS1 ≦ BGP1 ≦ STS2 is not satisfied, the process proceeds to step S62 without performing this calculation / recording.
[0063]
Then, each score SC memorize | stored in the memory | storage part 20e _p Read (0 ≦ p ≦ k) and read SC _p Then, the maximum value is detected by a sorting algorithm or the like. And its biggest SC _p The event time (VORB1, BGC1, VIC1, VOP1, BGP1) corresponding to (associated with) is set as story boundary point data (STB), and the value is recorded in the story boundary point data storage unit 20k (step 62). ).
Thus, in this embodiment, the weighting factor (W) (first weighting factor) used when extracting story boundary candidate section data (step S9) and the weighting factor (first weighting factor) used when extracting story boundary point data (step S10) W ′) (second weighting factor) is set separately. For this reason, it is possible to set these weighting factors to different factors, which allows higher flexibility in adjusting the contribution of topic boundary points and signal change points to story boundary candidate section extraction and story boundary part extraction. Can be done. For example, a story boundary candidate section is extracted with importance on topic boundary points, and a story boundary is extracted with importance on video change points. As a result, the story boundary can be extracted more appropriately.
[0064]
Next explained is the third embodiment of the invention.
This form is a modification of the first embodiment, and the story boundary point data extraction (step S10) is not compared with the score weighted by the weighting factor, but according to the priority given to the event itself. It differs from the first embodiment in that story boundary points are extracted from boundary candidate sections. Hereinafter, only points different from the first embodiment will be described, and description of matters common to the first embodiment will be omitted. In this embodiment, when the same functional configuration as that of the first embodiment is illustrated, the same reference numerals as those of the first embodiment are used.
[0065]
FIG. 15 is a block diagram exemplifying processing functions of the data analysis apparatus 400 of this embodiment constructed by executing the data analysis program stored in the data analysis program memory 26 in the hardware configuration illustrated in FIG. is there.
The difference between the configuration of this embodiment and the configuration of the first embodiment is that the weighting factor storage unit 20i of the first embodiment supplies a weighting factor (W) to the story boundary point extraction unit 20j. And an event order storage unit 401 that stores an event order (R) indicating the priority order of each event. The event ranking storage unit 401 is configured to be able to provide information on the event ranking (R) to the story boundary point extraction unit 20j.
[0066]
FIG. 16 illustrates the configuration of data stored in the event order storage unit 401.
As illustrated in FIG. 16, the event ranking storage unit 401 stores event rankings (R) associated with events 411 such as video signal change point data (VIC). In this example, the event rank “1” for the video signal change point data (VIC), the event rank “2” for the audio signal change point data (BGC), and the topic boundary point data (VORB). Event ranking “3” is assigned respectively. As described above, “event ranking” indicates the priority of each event, and when there are multiple events in the same story boundary candidate section, the time of the event with the highest event ranking (VORB1, BGC1) , VIC1) is selected as the story boundary.
[0067]
Next, a data analysis method performed by the data analysis apparatus 400 in this embodiment will be described. In addition, since the process (FIG. 7) of step S1-S9 demonstrated in 1st Embodiment is performed similarly about the data analysis method in 3rd Embodiment, the description is abbreviate | omitted here and is 1st. Only the process of step S10 of FIG. 7 which is the difference from the embodiment will be described.
FIG. 17 is a flowchart for explaining details of step S10 of FIG. 7 in the third embodiment. The details of the processing in step S10 in this form performed by the story boundary point extraction unit 20j (FIG. 4) will be described below with reference to this flowchart.
[0068]
In this process, for each story boundary candidate section (STS1-STS2 section), the event with the highest priority is selected from events such as video signal change point data (VIC) belonging to this story boundary candidate section. The time (VIC1, BGC1, VORB1) is the story boundary point. The process illustrated in FIG. 14 is performed for each story boundary candidate section for each story boundary candidate section, and the subscript q of STS1 and STS2 is omitted in the following.
In this case, first, SC ₁ ~ SC _m Is reset to 0 (step S71), the event ranking (R) 412 is acquired from the event ranking storage unit 401, and the event ranking (R) 412 is “1”, that is, video signal change point data (VIC ) (VIC1) is determined whether or not STS1 ≦ VIC1 ≦ STS2 is satisfied (step S72). Here, when STS1 ≦ VIC1 ≦ STS2 is satisfied, the time (VIC1) of the video signal change point data (VIC) with the maximum change score (VIC2) among the video signal change point data (VIC) that satisfies this is the story The boundary point data (STB) is set (step S73), and the process is terminated.
[0069]
On the other hand, when STS1 ≦ VIC1 ≦ STS2 is not satisfied, the event ranking (R) 412 is acquired from the event ranking storage unit 401, and the event ranking (R) 412 is “2”, that is, acoustic signal change point data. It is determined whether or not the time (BGC1) of (BGC) satisfies STS1 ≦ BGC1 ≦ STS2 (step S74). Here, when STS1 ≦ BGC1 ≦ STS2 is satisfied, the time (BGC1) of the sound signal change point data (BGC) with the maximum change score (BGC2) among the sound signal change point data (BGC) satisfying this is the story. The boundary point data (STB) is set (step S75), and the process ends.
On the other hand, when STS1 ≦ BGC1 ≦ STS2 is not satisfied, the event ranking (R) 412 is acquired from the event ranking storage unit 401, and the event ranking (R) 412 is “3”, that is, topic boundary point data ( It is determined whether or not the time (VORB1) of (VORB) satisfies STS1 ≦ VORB1 ≦ STS2 (step S76). When STS1 ≦ VORB1 ≦ STS2 is satisfied, the time (VORB1) of the topic boundary point data (VORB1) with the highest boundary score (VORB2) among the topic boundary point data (VORB) satisfying this is the story boundary point Data (STB) is set (step S77), and the process ends.
[0070]
As described above, in this embodiment, the event priority (R) 412 determines the event priority, selects the event with the highest priority from the events belonging to the story boundary candidate section, and selects the event time (VIC1, BGC1, VORB1) is the story boundary. Therefore, it becomes possible to preferentially select a more appropriate event as a story boundary point, and the story boundary point extraction process can be optimized.
In addition, this invention is not limited to the above-mentioned 1st-3rd implementation. For example, in these forms, voice recognition processing is used as pattern recognition processing and character strings are extracted from content. However, character recognition processing is used as pattern recognition processing, and characters from characters such as telop appearing in content video are used. A string may be extracted and the character string may be used to extract topic boundary points.
[0071]
Also, the time point at which this telop appears may be extracted as a video signal change point, and used as a signal change point to extract a story boundary point.
Furthermore, in this embodiment, the data analysis apparatus is configured by executing a predetermined program on the computer. However, at least a part of these processing contents may be realized by hardware using an electronic circuit. Good.
Further, as described in the first to third embodiments, the processing function of the data analysis apparatus can be realized by a computer. In this case, the processing contents of the functions that the data analysis apparatus should have are described by a program, and the processing functions can be realized on the computer by executing the program on the computer.
[0072]
The program describing the processing contents can be recorded on a computer-readable recording medium. The computer-readable recording medium may be any medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, or a semiconductor memory. Specifically, for example, as the magnetic recording device, a hard disk device, a flexible Discs, magnetic tapes, etc. as optical discs, DVD (Digital Versatile Disc), DVD-RAM (Random Access Memory), CD-ROM (Compact Disc Read Only Memory), CD-R (Recordable) / RW (ReWritable), etc. An MO (Magneto-Optical disc) or the like can be used as the magneto-optical recording medium.
[0073]
The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.
A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its storage device. When executing the process, the computer reads the program stored in its own recording medium and executes the process according to the read program. As another execution form of the program, the computer may read the program directly from the portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good.
[0074]
In addition, the program in the above is an instruction to the electronic computer, which is combined so that one result can be obtained, and other information provided for processing by the electronic computer, which conforms to the program Is also included.
[0075]
【The invention's effect】
As described above, in the present invention, the character string specified by the content data is extracted by pattern recognition processing, and the extracted character string is divided into semantically unity units. Topic boundary point data for specifying a series position (topic boundary point) is extracted, and a time series position (signal change point) of a physical change point of content specified by the content data is specified. Signal change point data is extracted. Then, using the extracted topic boundary point data and signal change point data, the time-series position (story boundary point) of the content boundary point of the content is specified. For this reason, it is possible to easily extract the story structure even in a section where a speech recognition result cannot be obtained from the content.
[Brief description of the drawings]
FIG. 1 is a conceptual diagram illustrating an outline of a data analysis apparatus.
FIG. 2 is a conceptual diagram for explaining a data analysis method.
FIG. 3 is a block diagram illustrating a hardware configuration of the data analysis apparatus.
FIG. 4 is a block diagram illustrating a processing function of the data analysis apparatus.
FIG. 5 is a diagram illustrating a configuration of data stored in a storage unit.
FIG. 6 is a diagram illustrating a configuration of data stored in a weight coefficient storage unit.
FIG. 7 is a flowchart for explaining a data analysis method performed by the data analysis apparatus.
FIG. 8 is a flowchart for explaining details of processing in step S9 in FIG. 7;
FIG. 9 is a flowchart for explaining details of step S10 in FIG. 7;
FIG. 10 is a block diagram illustrating a functional configuration of a data analysis apparatus when using story boundary point data.
FIG. 11 is a diagram illustrating a story structure display screen.
FIG. 12 is a block diagram illustrating a processing function of the data analysis apparatus.
FIG. 13 is a diagram illustrating a configuration of data stored in a boundary candidate section weight coefficient storage unit;
FIG. 14 is a flowchart for explaining details of step S10 in FIG. 7;
FIG. 15 is a block diagram illustrating a processing function of the data analysis apparatus.
FIG. 16 is a diagram illustrating a configuration of data stored in an event order storage unit.
FIG. 17 is a flowchart for explaining details of step S10 in FIG. 7;
[Explanation of symbols]
1, 20, 300, 400 Data analyzer
3 Character string extraction means
4 Topic boundary point extraction means
5 Signal change point extraction means
6 Story boundary point extraction means

Claims

In a data analysis device that analyzes the story structure of content data,
A character string extracting means for extracting a character string specified by the content data by a pattern recognition process;
The character string extracted by the character string extraction means is divided into semantically unity units, and topic boundaries for specifying the time-series positions of the division points (hereinafter referred to as “topic boundary points”). Topic boundary point extraction means for extracting point data;
Signal change point extraction means for extracting signal change point data for specifying a time-series position (hereinafter referred to as “signal change point”) of a physical change point of the content specified by the content data;
Using the topic boundary point data extracted by the topic boundary point extraction means and the signal change point data extracted by the signal change point extraction means, the time-series position of the specific boundary points of the content (Hereinafter referred to as “story boundary point”) story boundary point extraction means for extracting story boundary point data for specifying,
A data analysis apparatus comprising:

The story boundary extraction means includes:
If the score of the topic boundary point belonging to the search section having a time series width and the score of the signal change point belonging to the search section satisfy a predetermined condition, the search section is selected as a story boundary candidate. Interval and
Extracted for each story boundary candidate section, the topic boundary point belonging to the story boundary candidate section or the signal change point is a means to be the story boundary point,
The data analysis apparatus according to claim 1.

The story boundary extraction means includes:
When the value obtained by weighting the score of the topic boundary point and the score of the signal change point by weighting with a first weighting coefficient reaches a predetermined threshold value, the topic boundary point or the signal change point is The search section to which it belongs is a means to be the story boundary candidate section,
The data analysis apparatus according to claim 2.

The story boundary extraction means includes:
For each story boundary candidate section, the scores of the topic boundary point and the signal change point belonging to the story boundary candidate section are compared by weighting with a second weighting factor, and the score after weighting is compared with the story boundary point. The topic boundary point or signal change point that was the largest in the boundary candidate ward is a means to be the story boundary point,
The data analysis apparatus according to claim 2 or 3, characterized by the above.

The data analysis apparatus according to claim 4, wherein the second weighting coefficient is a coefficient different from the first weighting coefficient.

The story boundary extraction means includes:
From the topic boundary point and the signal change point belonging to the story boundary candidate section, the topic boundary point or the signal change point having the highest priority is a means for setting the story boundary point,
The data analysis apparatus according to claim 2 or 3, characterized by the above.

The topic boundary score is:
A score generated using the score at the time of the pattern recognition process,
A data analysis apparatus according to any one of claims 2 to 6.

The topic boundary score is:
A score generated using a score when extracting the topic boundary point data;
A data analysis apparatus according to any one of claims 2 to 6.

The score of the signal change point is
A score generated using a score when extracting the signal change point data;
A data analysis apparatus according to any one of claims 2 to 6.

A data analysis program for causing a computer to function as the data analysis apparatus according to claim 1.