JP2012022053A

JP2012022053A - Voice recognition device

Info

Publication number: JP2012022053A
Application number: JP2010158219A
Authority: JP
Inventors: Jun Ohashi; 純大橋; Takeshi Nagai; 剛永井
Original assignee: Fujitsu Toshiba Mobile Communication Ltd
Current assignee: Fujitsu Mobile Communications Ltd
Priority date: 2010-07-12
Filing date: 2010-07-12
Publication date: 2012-02-02

Abstract

PROBLEM TO BE SOLVED: To automatically control voice recognition parameters for content.SOLUTION: A voice recognition device includes a content acquisition section 107 which acquires content including voice data and a web page acquisition section 102 which acquires a web page for providing content. The voice recognition device includes a web page analysis section 103 which performs analysis based on the web page for providing content and extracts text indicating characteristics of the voice data. The voice recognition device includes a parameter control section 106 which controls voice recognition parameters for the voice data based on the extracted text and a voice recognition section 111 which performs voice recognition of the voice data in accordance with the controlled voice recognition parameters.

Description

本発明の実施形態は、音声認識に関する。 Embodiments of the present invention relate to speech recognition.

ユーザは、放送波、記録媒体またはネットワーク（例えば、動画共有サイト）を介して様々なコンテンツを利用できる。また、コンテンツ再生装置も多様化している。具体的には、ＴＶ受信機に限らず携帯電話機、パーソナルコンピュータ、ビデオゲーム機などがコンテンツ再生機能を備えることがある。 The user can use various contents via a broadcast wave, a recording medium, or a network (for example, a moving image sharing site). In addition, content reproduction apparatuses are diversified. Specifically, not only a TV receiver but also a mobile phone, a personal computer, a video game machine, and the like may have a content reproduction function.

コンテンツに含まれる音声データに対して音声認識を行い、音声認識結果を字幕などとして活用することが提案されている。音声認識は、音響モデル、言語モデル、単語辞書などの音声認識パラメータを用いて実現される。高精度な音声認識結果を得るためには、認識対象の音声データに対して音声認識パラメータを適切に制御することが重要である。例えば放送番組（主にニュース番組）の音声認識のために、手動による音声認識パラメータの制御（音響モデル及び言語モデルの学習など）が行われている。 It has been proposed to perform speech recognition on audio data included in content and use the speech recognition result as subtitles. Speech recognition is realized using speech recognition parameters such as an acoustic model, a language model, and a word dictionary. In order to obtain a highly accurate speech recognition result, it is important to appropriately control speech recognition parameters for speech data to be recognized. For example, for speech recognition of broadcast programs (mainly news programs), manual speech recognition parameter control (such as learning of an acoustic model and a language model) is performed.

特開２００４−３３３７３８号公報JP 2004-333738 A

コンテンツ毎に音声認識パラメータを手動で制御することは不便である。一方、音声認識パラメータを固定すれば、多様なコンテンツに対して高精度な音声認識を行うことは困難となる。 It is inconvenient to manually control speech recognition parameters for each content. On the other hand, if the speech recognition parameters are fixed, it is difficult to perform highly accurate speech recognition for various contents.

従って、本発明の実施形態は、コンテンツのための音声認識パラメータを自動制御することを目的とする。 Accordingly, embodiments of the present invention are directed to automatically controlling speech recognition parameters for content.

一態様に係る音声認識装置は、音声データを含むコンテンツを取得するコンテンツ取得部と、コンテンツを提供するＷｅｂページを取得するＷｅｂページ取得部とを含む。この音声認識装置は、コンテンツを提供するＷｅｂページに基づく解析を行って、音声データの特徴を示すテキストを抽出するＷｅｂページ解析部と、音声データのための音声認識パラメータを、抽出されたテキストに基づいて制御するパラメータ制御部とを含む。この音声認識装置は、制御された音声認識パラメータに従って音声データに対して音声認識を行う音声認識部を含む。 A speech recognition apparatus according to an aspect includes a content acquisition unit that acquires content including audio data, and a Web page acquisition unit that acquires a Web page that provides the content. This speech recognition apparatus performs analysis based on a web page that provides content, extracts a text indicating the characteristics of speech data, and converts speech recognition parameters for speech data into the extracted text. And a parameter control unit that performs control based on the control unit. The speech recognition apparatus includes a speech recognition unit that performs speech recognition on speech data according to controlled speech recognition parameters.

他の態様に係る音声認識装置は、音声データを含むコンテンツを取得するコンテンツ取得部を含む。この音声認識装置は、音声データの音声認識結果、コンテンツから分離された映像データの画像認識結果及びコンテンツから分離されたテキストデータのうち少なくとも一方に基づいてコンテンツに関連するＷｅｂページを取得するＷｅｂページ取得部を含む。この音声認識装置は、コンテンツに関連するＷｅｂページに基づく解析を行って、音声データの特徴を示すテキストを抽出するＷｅｂページ解析部と、音声データのための音声認識パラメータを、抽出されたテキストに基づいて制御するパラメータ制御部とを含む。この音声認識装置は、制御された音声認識パラメータに従って音声データに対して音声認識を行う音声認識部を含む。 A speech recognition apparatus according to another aspect includes a content acquisition unit that acquires content including audio data. The speech recognition apparatus acquires a web page related to content based on at least one of a speech recognition result of speech data, an image recognition result of video data separated from the content, and text data separated from the content. Includes an acquisition unit. This speech recognition apparatus performs analysis based on a web page related to content, extracts a text indicating the characteristics of the speech data, and converts speech recognition parameters for the speech data into the extracted text. And a parameter control unit that performs control based on the control unit. The speech recognition apparatus includes a speech recognition unit that performs speech recognition on speech data according to controlled speech recognition parameters.

第１の実施形態に係る音声認識装置を示すブロック図。1 is a block diagram showing a voice recognition device according to a first embodiment. 第２の実施形態に係る音声認識装置を示すブロック図。The block diagram which shows the speech recognition apparatus which concerns on 2nd Embodiment. 第３の実施形態に係る音声認識装置を示すブロック図。The block diagram which shows the speech recognition apparatus which concerns on 3rd Embodiment. 第４の実施形態に係る音声認識装置を示すブロック図。The block diagram which shows the speech recognition apparatus which concerns on 4th Embodiment. 第５の実施形態に係る音声認識装置を示すブロック図。The block diagram which shows the speech recognition apparatus which concerns on 5th Embodiment. 第６の実施形態に係る音声認識装置を示すブロック図。The block diagram which shows the speech recognition apparatus which concerns on 6th Embodiment. 解析パラメータの説明図。Explanatory drawing of an analysis parameter. 制御パラメータの説明図。Explanatory drawing of a control parameter.

以下、図面を参照して、本発明の実施形態について説明する。
（第１の実施形態）
図１に示すように、第１の実施形態に係る音声認識装置は、認識対象入力部１０１、Ｗｅｂページ取得部１０２、Ｗｅｂページ解析部１０３、解析パラメータ記憶部１０４、抽出テキスト処理部１０５、音声認識パラメータ制御部１０６、コンテンツ取得部１０７、コンテンツ解析部１０８、コンテンツ分離部１０９、音声入力部１１０、音声認識部１１１及び認識結果出力部１１２を有する。 Embodiments of the present invention will be described below with reference to the drawings.
(First embodiment)
As shown in FIG. 1, the speech recognition apparatus according to the first embodiment includes a recognition target input unit 101, a web page acquisition unit 102, a web page analysis unit 103, an analysis parameter storage unit 104, an extracted text processing unit 105, a speech. It has a recognition parameter control unit 106, a content acquisition unit 107, a content analysis unit 108, a content separation unit 109, a voice input unit 110, a voice recognition unit 111, and a recognition result output unit 112.

認識対象入力部１０１は、音声認識の対象となる音声データを含むコンテンツを提供するＷｅｂページの識別子をＷｅｂページ取得部１０２及びコンテンツ取得部１０７に入力する。Ｗｅｂページの識別子は、例えばＵＲＬ（Uniform Resource Locator）またはＵＲＩ（Uniform Resource Identifier）の形式で表現される。 The recognition target input unit 101 inputs, to the Web page acquisition unit 102 and the content acquisition unit 107, an identifier of a Web page that provides content including audio data that is a target of voice recognition. The identifier of the Web page is expressed in the form of URL (Uniform Resource Locator) or URI (Uniform Resource Identifier), for example.

Ｗｅｂページ取得部１０２は、認識対象入力部１０１からのＷｅｂページの識別子に従ってＷｅｂページを取得する。Ｗｅｂページ取得部１０２は、取得したＷｅｂページをＷｅｂページ解析部１０３に入力する。 The web page acquisition unit 102 acquires a web page according to the web page identifier from the recognition target input unit 101. The web page acquisition unit 102 inputs the acquired web page to the web page analysis unit 103.

Ｗｅｂページ解析部１０３は、Ｗｅｂページ取得部１０２からのＷｅｂページに基づいて解析を行う。具体的には、Ｗｅｂページ解析部１０３は、後述する解析パラメータを解析パラメータ記憶部１０４から取得し、この解析パラメータに従って解析を行う。Ｗｅｂページ解析部１０３は、この解析処理を通じて、音声認識の対象となる音声データの特徴（音響的特徴、言語的特徴など）を示すテキストを抽出する。Ｗｅｂページ解析部１０３は、抽出したテキストを抽出テキスト処理部１０５に入力する。 The web page analysis unit 103 performs analysis based on the web page from the web page acquisition unit 102. Specifically, the Web page analysis unit 103 acquires an analysis parameter, which will be described later, from the analysis parameter storage unit 104, and performs analysis according to the analysis parameter. Through this analysis process, the web page analysis unit 103 extracts text indicating the characteristics (acoustic characteristics, linguistic characteristics, etc.) of the speech data that is the target of speech recognition. The web page analysis unit 103 inputs the extracted text to the extracted text processing unit 105.

解析パラメータ記憶部１０４には、例えば図７に示す形式で解析パラメータが記憶される。図７の例では、解析パラメータ記憶部１０４は、Ｗｅｂページの識別子と、解析パラメータとを対応付けて記憶する。尚、図７の例では、特定のＷｅｂページの識別子に対して解析パラメータが対応付けられているが、全てのＷｅｂページの識別子または特定のＷｅｂページを除く全てのＷｅｂページの識別子に対して共通の解析パラメータが対応付けられてもよい。 In the analysis parameter storage unit 104, for example, analysis parameters are stored in the format shown in FIG. In the example of FIG. 7, the analysis parameter storage unit 104 stores an identifier of a Web page and an analysis parameter in association with each other. In the example of FIG. 7, the analysis parameter is associated with the identifier of a specific Web page, but is common to the identifiers of all Web pages or the identifiers of all Web pages excluding a specific Web page. These analysis parameters may be associated with each other.

図７の例では、解析パラメータは、絞り込み条件及びこの絞り込み条件の適用対象を含む。但し、一部または全部のＷｅｂページの識別子に関して、これらのパラメータが指定されなくてもよい。絞り込み条件とは、入力されたＷｅｂページに基づく解析を実施するか否かを判定するための条件である。絞り込み条件の適用対象とは、入力されたＷｅｂページにおいて絞り込み条件の充足の有無を判定するために参照される範囲である。例えば、Ｗｅｂページの識別子が「http://xxxx.ne.jp」であれば、Ｗｅｂページ解析部１０３はこのＷｅｂページのソースコードにおいて「<title>(.+)</title>」のカッコで示した範囲に文字列「ニュース」が含まれているか否かを判定する（７０１）。ここで、(.+)という記号は、ＵＮＩＸ（登録商標）または各種プログラミング言語で利用されているように、任意の文字の１個以上の出現（即ち、任意の文字列）を表す正規表現とする。Ｗｅｂページ解析部１０３は、絞り込み条件が充足されていればＷｅｂページに基づく解析を実施し、そうでなければ省略する。また、図７の例では、Ｗｅｂページの識別子「http://xxxx.ne.jp」に対して別の絞り込み条件も対応付けられている。従って、Ｗｅｂページ解析部１０３は、このＷｅｂページのソースコードにおいて「<genre>(.+)</genre)」のカッコで示した範囲に文字列「スポーツ」または文字列「バラエティ」が含まれているか否かを判定する（７０２）。更に、Ｗｅｂページの識別子が「http://yyyy.ne.jp」であれば、Ｗｅｂページ解析部１０３はこのＷｅｂページのソースコードにおいて「<title>(.+)</title>」のカッコで示した範囲に文字列「○○○○」が含まれているか否かを判定したり（７０３）、このＷｅｂページのソースコードにおいて「(.+)の番組です」のカッコで示した範囲に文字列「□□□□」が含まれているか否かを判定したり（７０４）する。尚、図７の例では、絞り込み条件適用対象は、ＨＴＭＬの要素の配置位置またはタグなどのＷｅｂページのデータ構造によって規定されたり、特定の文字列によって規定されたりしているが、これらに限られない。 In the example of FIG. 7, the analysis parameter includes a narrowing condition and an application target of the narrowing condition. However, these parameters may not be specified for some or all of the Web page identifiers. The narrowing-down condition is a condition for determining whether or not to perform analysis based on the input Web page. The application target of the narrowing condition is a range that is referred to in order to determine whether or not the narrowing condition is satisfied in the input Web page. For example, if the identifier of the Web page is “http://xxxx.ne.jp”, the Web page analysis unit 103 uses “<title> (. +) </ Title>” parentheses in the source code of this Web page. It is determined whether or not the character string “news” is included in the range indicated by (701). Here, the symbol (. +) Is a regular expression representing one or more occurrences of an arbitrary character (that is, an arbitrary character string) as used in UNIX (registered trademark) or various programming languages. To do. The Web page analysis unit 103 performs analysis based on the Web page if the narrowing-down conditions are satisfied, and omits otherwise. In the example of FIG. 7, another narrowing condition is also associated with the Web page identifier “http://xxxx.ne.jp”. Therefore, the Web page analysis unit 103 includes the character string “Sports” or the character string “Variety” in the range indicated by the parentheses “<genre> (. +) </ Genre)” in the source code of the Web page. It is determined whether or not (702). Further, if the identifier of the Web page is “http://yyyy.ne.jp”, the Web page analysis unit 103 uses “<title> (. +) </ Title>” parentheses in the source code of this Web page. It is determined whether or not the character string “XXX” is included in the range indicated by (703), or the range indicated in parentheses “(. +) Program” in the source code of this Web page It is determined whether or not the character string “□□□□” is included in (704). In the example of FIG. 7, the filtering condition application target is defined by the data structure of the Web page such as the arrangement position of the HTML element or the tag, or by a specific character string. I can't.

Ｗｅｂページ解析部１０３は、絞り込み条件が充足されていれば、入力されたＷｅｂページに基づく解析を実施する。具体的には、Ｗｅｂページ解析部１０３は、Ｗｅｂページの識別子に対応するテキスト解析対象からＷｅｂページの識別子に対応する抽出方法に従ってテキストを抽出する。テキスト解析対象は、入力されたＷｅｂページのソースコード（７０１，７０３）、別のＷｅｂページのソースコード（７０２，７０４）などである。入力されたＷｅｂページでなく別のＷｅｂページのソースコードを解析対象とすることの技術的意義は、入力されたＷｅｂページが別のＷｅｂページにおいて提供されるコンテンツを引用している場合などに、より詳細な情報を期待できることが挙げられる。テキスト抽出方法は、全文抽出（７０１）、特定の文字列を含む部分の抽出（７０３）、Ｗｅｂページのデータ構造によって規定される特定部分の抽出（７０２，７０４）など様々である。Ｗｅｂページ解析部１０３は、１つに限らず複数のテキストを抽出してもよい。 The Web page analysis unit 103 performs analysis based on the input Web page if the narrowing-down conditions are satisfied. Specifically, the Web page analysis unit 103 extracts text from a text analysis target corresponding to the Web page identifier according to an extraction method corresponding to the Web page identifier. The text analysis target is an input web page source code (701, 703), another web page source code (702, 704), and the like. The technical significance of analyzing the source code of another Web page instead of the input Web page is that the input Web page cites content provided in another Web page, etc. It is possible to expect more detailed information. There are various text extraction methods such as full text extraction (701), extraction of a part including a specific character string (703), extraction of a specific part defined by the data structure of the Web page (702, 704). The web page analysis unit 103 may extract a plurality of texts without being limited to one.

抽出テキスト処理部１０５は、Ｗｅｂページ解析部１０３からの抽出テキストを制御パラメータに変換する。制御パラメータは、後述する音声認識パラメータ制御部１０６によって音声認識パラメータの制御に使用される。音声認識パラメータは、例えば音響モデル、単語辞書または言語モデルを含む。音響モデルは、音素または音節の周波数パターンなどの音響的特徴を表す。単語辞書は、認識可能な単語の情報（表記情報、品詞情報など）を列挙する。言語モデルは、単語間の接続関係などの言語的特徴を表す。後述する音声認識部１１１は、音声データの音響的特徴と、認識候補の言語的特徴とに基づく音声認識を行って認識結果を生成する。 The extracted text processing unit 105 converts the extracted text from the web page analysis unit 103 into control parameters. The control parameter is used for controlling the speech recognition parameter by the speech recognition parameter control unit 106 described later. The speech recognition parameter includes, for example, an acoustic model, a word dictionary, or a language model. The acoustic model represents acoustic features such as phoneme or syllable frequency patterns. The word dictionary lists recognizable word information (notation information, part of speech information, etc.). The language model represents linguistic features such as connection relationships between words. The voice recognition unit 111 described later performs voice recognition based on the acoustic features of the voice data and the linguistic features of the recognition candidates, and generates a recognition result.

具体的には、抽出テキスト処理部１０５は、抽出テキストとＷｅｂページ解析部１０３から指定される抽出テキスト処理方法に従って処理を行う。典型的には、抽出テキスト処理部１０５は、抽出テキストに対して形態素解析を行う。抽出テキスト処理部１０５は、これら形態素解析結果を制御パラメータに変換し、音声認識パラメータ制御部１０６に入力する。例えば、抽出テキスト処理部１０５は、形態素解析結果から所定のキーワードを検出し、発言者、ジャンル、言語、方言またはシチュエーションなどの制御パラメータとして直接変換してもよいし、人名、地名などを扱うＷｅｂサービスを利用して制御パラメータに変換してもよいし、オントロジー辞書を利用して制御パラメータに変換してもよい。また、抽出テキスト処理部１０５は、コンテンツの再生時間と制御パラメータとの時間的な対応関係を取得できるならば、この制御パラメータを適用する時間的な範囲を指定してもよい。例えばコンテンツを提供するＷｅｂページにおいてこのコンテンツの再生時間と対応付けられてコメントなどが記載されていることがある。また、抽出テキスト処理部１０５は、キーワードの一部または全部を辞書パラメータに変換してもよい。音声認識パラメータ制御部１０６は、辞書パラメータに対応する単語の情報（表記情報、品詞情報など）を単語辞書に登録したり、この単語の認識優先度を高く設定したりする。更に、抽出テキスト処理部１０５は、抽出テキストを言語モデルパラメータに変換してもよい。言語モデルパラメータは、言語モデルの更新（学習）などに利用できる。尚、辞書パラメータ及び言語モデルパラメータは、制御パラメータの一部である。 Specifically, the extracted text processing unit 105 performs processing according to the extracted text and the extracted text processing method specified by the Web page analysis unit 103. Typically, the extracted text processing unit 105 performs morphological analysis on the extracted text. The extracted text processing unit 105 converts these morphological analysis results into control parameters and inputs them to the speech recognition parameter control unit 106. For example, the extracted text processing unit 105 may detect a predetermined keyword from the morphological analysis result and directly convert it as a control parameter such as a speaker, a genre, a language, a dialect, or a situation. It may be converted into control parameters using a service, or may be converted into control parameters using an ontology dictionary. Further, if the extracted text processing unit 105 can acquire the temporal correspondence between the content reproduction time and the control parameter, the extracted text processing unit 105 may specify a temporal range to which the control parameter is applied. For example, a comment or the like may be described in association with the playback time of the content on a Web page that provides the content. The extracted text processing unit 105 may convert some or all of the keywords into dictionary parameters. The speech recognition parameter control unit 106 registers word information (notation information, part-of-speech information, etc.) corresponding to the dictionary parameter in the word dictionary, or sets a high recognition priority for this word. Further, the extracted text processing unit 105 may convert the extracted text into language model parameters. The language model parameters can be used for updating (learning) the language model. Note that the dictionary parameter and the language model parameter are part of the control parameter.

制御パラメータを図８に例示する。制御パラメータは、例えば、発言者、ジャンル、言語、方言またはシチュエーションなどの様々な属性を含む。「制御パラメータ：発言者」は、「２０代男性」、「２０代女性」などの発言者の世代及び性別の一方または両方を示すものであってもよいし、特定の個人（アナウンサー、男優、女優など）を示すものであってもよい。「制御パラメータ：発言者」は、音響モデル、言語モデル、単語辞書などの選択に有効である。例えば、「制御パラメータ：発言者」が「女性」であれば、音声認識パラメータ制御部１０６は女性向けの音響モデルなどを選択することができる。また、「制御パラメータ：発言者」が特定の個人を示すものであれば、音声認識パラメータ制御部１０６がこの特定の個人に最適化された音響モデル、言語モデル、単語辞書などを選択することができる。 The control parameters are illustrated in FIG. The control parameters include various attributes such as speaker, genre, language, dialect or situation, for example. “Control parameter: speaker” may indicate the generation and / or gender of a speaker such as “male in 20s” and “female in 20s”, or a specific individual (announcer, actor, An actress). “Control parameter: speaker” is effective in selecting an acoustic model, a language model, a word dictionary, and the like. For example, if “control parameter: speaker” is “female”, the speech recognition parameter control unit 106 can select an acoustic model or the like for women. If “control parameter: speaker” indicates a specific individual, the speech recognition parameter control unit 106 may select an acoustic model, a language model, a word dictionary, or the like optimized for the specific individual. it can.

「制御パラメータ：ジャンル」は、「ニュース」、「ドラマ」、「バラエティ」などのコンテンツのジャンルを示す。「制御パラメータ：ジャンル」は、音響モデル、言語モデル、単語辞書などの選択に有効である。例えば、「制御パラメータ：ジャンル」が「ニュース」であれば、音声認識パラメータ制御部１０６はニュース向けの音響モデル、言語モデル及び単語辞書を選択することができる。 “Control parameter: genre” indicates a genre of content such as “news”, “drama”, “variety”, and the like. “Control parameter: genre” is effective for selecting an acoustic model, a language model, a word dictionary, and the like. For example, if “control parameter: genre” is “news”, the speech recognition parameter control unit 106 can select an acoustic model, a language model, and a word dictionary for news.

「制御パラメータ：言語」は、「日本語」、「英語」、「中国語」などの発言者の使用言語を示す。「制御パラメータ：言語」は、音響モデル、言語モデル、単語辞書などの選択に有効である。「制御パラメータ：方言」は、「標準語」、「関西弁」、「九州弁」など前述の「制御パラメータ：言語」のサブセットに相当する方言を示す。故に、通常、「制御パラメータ：方言」が判明すれば「制御パラメータ：言語」も判明するので、抽出テキスト処理部１０５は「制御パラメータ：方言」が判明すれば対応する制御「パラメータ：言語」を自動的に決定してもよい。 “Control parameter: language” indicates a language used by a speaker such as “Japanese”, “English”, “Chinese”, and the like. “Control parameter: language” is effective for selecting an acoustic model, a language model, a word dictionary, and the like. “Control parameter: dialect” indicates a dialect corresponding to a subset of the above-mentioned “control parameter: language” such as “standard language”, “Kansai dialect”, “Kyushu dialect”. Therefore, normally, if “control parameter: dialect” is found, “control parameter: language” is also found. Therefore, if “control parameter: dialect” is found, the extracted text processing unit 105 sets the corresponding control “parameter: language”. It may be determined automatically.

「制御パラメータ：シチュエーション」は、「電車内」、「静かな場所」、「自動車内」などの音声データの収録環境を示す。「制御パラメータ：シチュエーション」は、音響モデルの選択、ノイズキャンセリング処理の制御などに有効である。 “Control parameter: situation” indicates an audio data recording environment such as “inside a train”, “quiet location”, “inside a car”, and the like. “Control parameter: situation” is effective for selecting an acoustic model, controlling noise canceling processing, and the like.

図７にも例示されるように、テキスト解析対象、テキスト抽出方法及び抽出テキスト処理方法は、特定のＷｅｂページ識別子に対して１組に限らず複数組設けられてもよい（７０１）。例えば、抽出テキスト処理部１０５は先頭キーワードをＷｅｂページ解析部１０３に戻し、Ｗｅｂページ解析部１０３はこの先頭キーワードを含む検索式を生成してもよい。Ｗｅｂページ解析部１０３は、この検索式を所定の検索エンジンに送信し、検索された１つまたは複数のＷｅｂページに基づいてテキストを抽出してもよい。検索された複数のＷｅｂページに基づいてテキストを抽出する場合には、個別の優先度が抽出テキストに割り当てられてもよい。優先度は、各Ｗｅｂページの識別子によって決定されてもよいし、検索結果における各Ｗｅｂページのソート順序によって決定されてもよい。優先度は、制御パラメータの数が過剰である場合などに、有効とする制御パラメータの選定するために利用できる。また、図７に例示するように検索されたＷｅｂページの識別子に応じてテキスト抽出方法が更に切り替えられてもよい。また、先頭キーワードなどに基づいて検索されたＷｅｂページがＷｅｂページ解析部１０３に新たに入力されたＷｅｂページとして扱われてもよい。 As illustrated in FIG. 7, the text analysis target, the text extraction method, and the extracted text processing method are not limited to one set for a specific Web page identifier, and a plurality of sets may be provided (701). For example, the extracted text processing unit 105 may return the top keyword to the web page analysis unit 103, and the web page analysis unit 103 may generate a search expression including the top keyword. The Web page analysis unit 103 may transmit this search formula to a predetermined search engine and extract text based on one or more searched Web pages. When extracting text based on a plurality of searched Web pages, individual priorities may be assigned to the extracted text. The priority may be determined by the identifier of each Web page, or may be determined by the sort order of each Web page in the search result. The priority can be used to select a valid control parameter when the number of control parameters is excessive. In addition, as illustrated in FIG. 7, the text extraction method may be further switched according to the identifier of the searched Web page. In addition, a web page searched based on the head keyword or the like may be handled as a web page newly input to the web page analysis unit 103.

音声認識パラメータ制御部１０６は、抽出テキスト処理部１０５からの制御パラメータに従って認識対象の音声データのための音声認識パラメータを制御する。例えば、音声認識パラメータ制御部１０６は、「制御パラメータ：言語」または「制御パラメータ：方言」に従って音響モデル、言語モデル及び単語辞書を粗く選択し、「制御パラメータ：発言者」または「制御パラメータ：ジャンル」に従って音響モデル、言語モデル及び単語辞書をより細かく選択することができる。また、音声認識パラメータ制御部１０６は、「制御パラメータ：シチュエーション」に従って音響モデルをより細かく選択したり、ノイズキャンセリング処理の制御などを行ったりしてもよい。音声認識パラメータ制御部１０６は、辞書パラメータが示す単語を単語辞書に登録したり、この単語の認識優先度を高く設定したりしてもよい。音声認識パラメータ制御部１０６は、言語モデルパラメータに従って言語モデルを更新してもよい。音声認識パラメータ制御部１０６は、言語モデルまたは単語辞書を一時的に更新してもよいし、継続的に更新してもよい。即ち、音声認識パラメータ制御部１０６は、言語モデルまたは単語辞書の更新を、対応する音声認識処理の終了後に無効としてもよいし有効としてもよい。言語モデルまたは単語辞書の更新が一時的であるか継続的であるかは予め定められてもよいし、制御パラメータによって指定されてもよいし、各言語モデルまたは各単語辞書について個別に定められてもよい。 The speech recognition parameter control unit 106 controls speech recognition parameters for speech data to be recognized according to the control parameter from the extracted text processing unit 105. For example, the speech recognition parameter control unit 106 roughly selects an acoustic model, a language model, and a word dictionary according to “control parameter: language” or “control parameter: dialect”, and “control parameter: speaker” or “control parameter: genre”. The acoustic model, language model, and word dictionary can be selected in more detail. Further, the voice recognition parameter control unit 106 may select an acoustic model in more detail according to “control parameter: situation”, or may control noise canceling processing. The speech recognition parameter control unit 106 may register the word indicated by the dictionary parameter in the word dictionary, or set the recognition priority of this word high. The speech recognition parameter control unit 106 may update the language model according to the language model parameter. The voice recognition parameter control unit 106 may update the language model or the word dictionary temporarily or continuously. That is, the speech recognition parameter control unit 106 may invalidate or validate the update of the language model or the word dictionary after the corresponding speech recognition process is finished. Whether the update of the language model or the word dictionary is temporary or continuous may be determined in advance, may be specified by a control parameter, or individually determined for each language model or each word dictionary Also good.

コンテンツ取得部１０７は、認識対象入力部１０１からのＷｅｂページの識別子に対応するコンテンツを取得する。コンテンツ取得部１０７は、取得したコンテンツをコンテンツ解析部１０８に入力する。 The content acquisition unit 107 acquires content corresponding to the identifier of the Web page from the recognition target input unit 101. The content acquisition unit 107 inputs the acquired content to the content analysis unit 108.

コンテンツ解析部１０８は、コンテンツ取得部１０７からのコンテンツを解析する。コンテンツ解析部１０８は、コンテンツからメタデータ及びメディアデータを抽出し、このコンテンツをコンテンツ分離部１０９に入力する。 The content analysis unit 108 analyzes the content from the content acquisition unit 107. The content analysis unit 108 extracts metadata and media data from the content, and inputs the content to the content separation unit 109.

コンテンツ分離部１０９は、コンテンツ解析部１０８からのメディアデータに含まれる音声データを分離する。コンテンツ分離部１０９は、分離した音声データを音声入力部１１０に入力する。 The content separation unit 109 separates audio data included in the media data from the content analysis unit 108. The content separation unit 109 inputs the separated audio data to the audio input unit 110.

音声入力部１１０は、コンテンツ分離部１０９からの音声データを音声認識部１１１に適した形式に変換する。音声入力部１１０は、変換済みの音声データを音声認識部１１１に入力する。 The voice input unit 110 converts the voice data from the content separation unit 109 into a format suitable for the voice recognition unit 111. The voice input unit 110 inputs the converted voice data to the voice recognition unit 111.

音声認識部１１１は、認識対象となる音声データに関して前述の音声認識パラメータ制御部１０６の処理が完了してから、制御された音声認識パラメータに従って音声入力部１１０からの音声データに対して音声認識を行う。音声認識部１１１は、認識結果を認識結果出力部１１２に入力する。 The voice recognition unit 111 performs voice recognition on the voice data from the voice input unit 110 according to the controlled voice recognition parameter after the processing of the voice recognition parameter control unit 106 is completed for the voice data to be recognized. Do. The voice recognition unit 111 inputs the recognition result to the recognition result output unit 112.

認識結果出力部１１２は、認識結果を出力する。例えば、認識結果出力部１１２は、認識結果を字幕として図示しない表示部にコンテンツの再生と同期して表示させてもよいし、認識結果をコンテンツのメタデータとして図示しない記憶媒体に保存してもよいし、認識結果をコンテンツのシーン検出に利用してもよい。 The recognition result output unit 112 outputs a recognition result. For example, the recognition result output unit 112 may display the recognition result as a caption on a display unit (not shown) in synchronization with the reproduction of the content, or save the recognition result as content metadata on a storage medium (not shown). Alternatively, the recognition result may be used for scene detection of content.

以上説明したように、第１の実施形態に係る音声認識装置は、コンテンツを提供するＷｅｂページに基づいて音声認識パラメータを制御する。従って、本実施形態に係る音声認識装置によれば、コンテンツのための音声認識パラメータを自動制御できる。 As described above, the speech recognition apparatus according to the first embodiment controls speech recognition parameters based on the Web page that provides content. Therefore, the speech recognition apparatus according to the present embodiment can automatically control speech recognition parameters for content.

（第２の実施形態）
図２に示すように、第２の実施形態に係る音声認識装置は、図１の音声認識装置において音声認識パラメータ制御部１０６を音声認識パラメータ制御部２０６に、コンテンツ分離部１０９をコンテンツ分離部２０９に夫々置換し、映像入力部２１３及び画像認識部２１４を追加した構成に相当する。以下の説明では、図２において図１と同一部分には同一符号を付して示し、異なる部分を中心に述べる。 (Second Embodiment)
As shown in FIG. 2, the speech recognition apparatus according to the second embodiment includes a speech recognition parameter control unit 106 as a speech recognition parameter control unit 206 and a content separation unit 109 as a content separation unit 209 in the speech recognition apparatus of FIG. This corresponds to a configuration in which a video input unit 213 and an image recognition unit 214 are added. In the following description, the same parts in FIG. 2 as those in FIG. 1 are denoted by the same reference numerals, and different parts will be mainly described.

コンテンツ分離部２０９は、コンテンツ解析部１０８からのメディアデータに含まれる音声データ及び映像データを分離する。コンテンツ分離部２０９は、分離した音声データを音声入力部１１０に入力する。コンテンツ分離部２０９は、分離した映像データを映像入力部２１３に入力する。 The content separation unit 209 separates audio data and video data included in the media data from the content analysis unit 108. The content separation unit 209 inputs the separated audio data to the audio input unit 110. The content separation unit 209 inputs the separated video data to the video input unit 213.

映像入力部２１３は、コンテンツ分離部２０９からの映像データを画像認識部２１４に適した形式に変換する。映像入力部２１３は、変換済みの映像データを画像認識部２１４に入力する。尚、映像データ中の一部のフレームに対する画像認識を省略するために、映像入力部２１３は、コンテンツ分離部２０９からの映像データ中のフレームを間引いてもよい。 The video input unit 213 converts the video data from the content separation unit 209 into a format suitable for the image recognition unit 214. The video input unit 213 inputs the converted video data to the image recognition unit 214. Note that the video input unit 213 may thin out frames in the video data from the content separation unit 209 in order to omit image recognition for some frames in the video data.

画像認識部２１４は、映像入力部２１３からの映像データに対して画像認識を行う。画像認識部２１４は、認識結果に基づいて前述の制御パラメータを生成し、音声認識パラメータ制御部２０６に入力する。具体的には、画像認識部２１４は映像中に表示されたテキスト（例えば、テロップ、番組出演者の名前など）を認識し、このテキストに形態素解析を行ってよい。画像認識部２１４は、これら形態素解析結果を制御パラメータに変換し、音声認識パラメータ制御部２０６に入力する。例えば、画像認識部２１４は、形態素解析結果から所定のキーワードを検出し、発言者、ジャンル、言語、方言またはシチュエーションなどの制御パラメータとして直接変換してもよいし、人名、地名などを扱うＷｅｂサービスを利用して制御パラメータに変換してもよいし、オントロジー辞書を利用して制御パラメータに変換してもよい。また、画像認識部２１４は、コンテンツの再生時間と制御パラメータとの時間的な対応関係を取得できるならば、制御パラメータを適用する時間的な範囲を指定してもよい。例えば、画像認識部２１４は、テキストが表示される再生時間をメタデータなどから取得できる。また、画像認識部２１４は、キーワードの一部または全部を辞書パラメータに変換してもよい。更に、画像認識部２１４は、認識結果を言語モデルパラメータに変換してもよい。言語モデルパラメータは、言語モデルの更新（学習）などに利用できる。 The image recognition unit 214 performs image recognition on the video data from the video input unit 213. The image recognition unit 214 generates the above-described control parameter based on the recognition result, and inputs it to the voice recognition parameter control unit 206. Specifically, the image recognition unit 214 may recognize text (for example, a telop or the name of a program performer) displayed in the video and perform morphological analysis on the text. The image recognition unit 214 converts these morphological analysis results into control parameters and inputs them to the speech recognition parameter control unit 206. For example, the image recognition unit 214 may detect a predetermined keyword from the morphological analysis result and directly convert it as a control parameter such as a speaker, a genre, a language, a dialect, or a situation, or a web service that handles a person name, a place name, or the like May be converted into control parameters using an ontology dictionary, or may be converted into control parameters using an ontology dictionary. In addition, the image recognition unit 214 may designate a temporal range to which the control parameter is applied if the temporal correspondence between the content reproduction time and the control parameter can be acquired. For example, the image recognizing unit 214 can acquire the reproduction time during which text is displayed from metadata or the like. Further, the image recognition unit 214 may convert some or all of the keywords into dictionary parameters. Further, the image recognition unit 214 may convert the recognition result into a language model parameter. The language model parameters can be used for updating (learning) the language model.

画像認識部２１４は、文字のサイズ、形状（フォント）、画面内位置、表示間隔などに応じて各単語の制御パラメータへの変換方法を切り替えたり、優先度を割り当てたりしてもよい。優先度は、制御パラメータの数が過剰である場合などに、有効とする制御パラメータの選定するために利用できる。例えば、文字のサイズが大きいほど高い優先度を割り当てたり、文字の形状が太字などの強調表示に相当するものであれば高い優先度を割り当てたり、特定の画面内位置（例えば、番組出演者の名前が表示されやすい画面下部など）に高い優先度を割り当てたりしてもよい。また、画像認識部２１４は、文字に限らず特定の放送局、番組、人物、企業、団体、商品、サービスなどを表す特定のマーク（ロゴ）を認識し、制御パラメータに変換してもよい。例えば、画像認識部２１４が、特定の番組を示すマークを認識すれば、その番組に対応する「制御パラメータ：ジャンル」、「制御パラメータ：言語」などに変換してもよい。 The image recognition unit 214 may switch the conversion method of each word into a control parameter or assign a priority according to the character size, shape (font), screen position, display interval, and the like. The priority can be used to select a valid control parameter when the number of control parameters is excessive. For example, a higher priority is assigned to a larger character size, or a higher priority is assigned if the character shape corresponds to highlighting such as bold, or a specific on-screen position (for example, High priority may be assigned to the bottom of the screen where names are easily displayed. Further, the image recognition unit 214 may recognize a specific mark (logo) representing a specific broadcasting station, program, person, company, organization, product, service, etc. without being limited to characters, and may convert the mark into a control parameter. For example, if the image recognizing unit 214 recognizes a mark indicating a specific program, it may be converted into “control parameter: genre”, “control parameter: language”, etc. corresponding to the program.

音声認識パラメータ制御部２０６は、抽出テキスト処理部１０５及び画像認識部２１４からの制御パラメータに従って認識対象の音声データのための音声認識パラメータを制御する。例えば、音声認識パラメータ制御部２０６は、「制御パラメータ：言語」または「制御パラメータ：方言」に従って音響モデル、言語モデル及び単語辞書を粗く選択し、「制御パラメータ：発言者」または「制御パラメータ：ジャンル」に従って音響モデル、言語モデル及び単語辞書をより細かく選択することができる。また、音声認識パラメータ制御部２０６は、「制御パラメータ：シチュエーション」に従って音響モデルをより細かく選択したり、ノイズキャンセリング処理の制御などを行ったりしてもよい。音声認識パラメータ制御部２０６は、辞書パラメータが示す単語を単語辞書に登録したり、この単語の認識優先度を高く設定したりしてもよい。音声認識パラメータ制御部２０６は、言語モデルパラメータに従って言語モデルを更新してもよい。音声認識パラメータ制御部２０６は、言語モデルまたは単語辞書を一時的に更新してもよいし、継続的に更新してもよい。 The speech recognition parameter control unit 206 controls speech recognition parameters for speech data to be recognized according to the control parameters from the extracted text processing unit 105 and the image recognition unit 214. For example, the speech recognition parameter control unit 206 roughly selects an acoustic model, a language model, and a word dictionary in accordance with “control parameter: language” or “control parameter: dialect”, and “control parameter: speaker” or “control parameter: genre”. The acoustic model, language model, and word dictionary can be selected in more detail. Further, the voice recognition parameter control unit 206 may select an acoustic model in more detail according to “control parameter: situation”, or may control noise canceling processing. The speech recognition parameter control unit 206 may register the word indicated by the dictionary parameter in the word dictionary, or set the recognition priority of this word high. The speech recognition parameter control unit 206 may update the language model according to the language model parameter. The voice recognition parameter control unit 206 may update the language model or the word dictionary temporarily or continuously.

音声認識パラメータ制御部２０６は、入力される制御パラメータの一部を音声認識パラメータの制御に使用しなくてもよい。例えば、音声認識パラメータ制御部２０６は、抽出テキスト処理部１０５及び画像認識部２１４のいずれか一方からの制御パラメータを優先的に使用してもよいし、抽出テキスト処理部１０５及び画像認識部２１４を区別せずに（例えば各制御パラメータに割り当てられた優先度に従って）制御パラメータを選定してもよい。また、音声認識パラメータ制御部２０６は、抽出テキスト処理部１０５及び画像認識部２１４の両方から同一の制御パラメータが入力される場合に、この制御パラメータを優先的に使用してもよい。 The voice recognition parameter control unit 206 may not use some of the input control parameters for controlling the voice recognition parameters. For example, the speech recognition parameter control unit 206 may preferentially use a control parameter from either the extracted text processing unit 105 or the image recognition unit 214, or may use the extracted text processing unit 105 and the image recognition unit 214. Control parameters may be selected without distinction (for example, according to the priority assigned to each control parameter). Further, when the same control parameter is input from both the extracted text processing unit 105 and the image recognition unit 214, the voice recognition parameter control unit 206 may use this control parameter preferentially.

以上説明したように、第２の実施形態に係る音声認識装置は、コンテンツを提供するＷｅｂページ及びコンテンツに含まれる映像データの画像認識結果に基づいて音声認識パラメータを制御する。従って、本実施形態に係る音声認識装置によれば、コンテンツのための音声認識パラメータを自動制御できる。 As described above, the speech recognition apparatus according to the second embodiment controls the speech recognition parameters based on the image recognition result of the video data included in the Web page that provides the content and the content. Therefore, the speech recognition apparatus according to the present embodiment can automatically control speech recognition parameters for content.

（第３の実施形態）
図３に示すように、第３の実施形態に係る音声認識装置は、図１の音声認識装置において音声認識パラメータ制御部１０６を音声認識パラメータ制御部３０６に、コンテンツ分離部１０９をコンテンツ分離部３０９に夫々置換し、分離テキスト入力部３１５及び分離テキスト処理部３１６を追加した構成に相当する。以下の説明では、図３において図１と同一部分には同一符号を付して示し、異なる部分を中心に述べる。 (Third embodiment)
As shown in FIG. 3, the speech recognition apparatus according to the third embodiment includes a speech recognition parameter control unit 106 in the speech recognition parameter control unit 306 and a content separation unit 109 in the content recognition unit 309 in the speech recognition apparatus in FIG. And a separated text input unit 315 and a separated text processing unit 316 are added. In the following description, the same parts in FIG. 3 as those in FIG.

コンテンツ分離部３０９は、コンテンツ解析部１０８からのメディアデータに含まれる音声データ及びテキストデータを分離する。また、コンテンツ分離部３０９は、メタデータに含まれるテキストデータを分離してもよい。コンテンツ分離部３０９は、分離した音声データを音声入力部１１０に入力する。コンテンツ分離部３０９は、分離したテキストデータを分離テキスト入力部３１５に入力する。 The content separation unit 309 separates audio data and text data included in the media data from the content analysis unit 108. The content separation unit 309 may separate text data included in the metadata. The content separation unit 309 inputs the separated audio data to the audio input unit 110. The content separation unit 309 inputs the separated text data to the separation text input unit 315.

分離テキスト入力部３１５は、コンテンツ分離部３０９からの分離テキストデータを分離テキスト処理部３１６に適した形式に変換する。分離テキスト入力部３１５は、変換済みの分離テキストデータを分離テキスト処理部３１６に入力する。 The separated text input unit 315 converts the separated text data from the content separating unit 309 into a format suitable for the separated text processing unit 316. The separated text input unit 315 inputs the converted separated text data to the separated text processing unit 316.

分離テキスト処理部３１６は、分離テキスト入力部３１５からの分離テキストに基づいて制御パラメータを生成し、音声認識パラメータ制御部３０６に入力する。具体的には、分離テキスト処理部３１６は、分離テキストに形態素解析を行ってよい。分離テキスト処理部３１６は、これら形態素解析結果を制御パラメータに変換し、音声認識パラメータ制御部３０６に入力する。例えば、分離テキスト処理部３１６は、形態素解析結果から所定のキーワードを検出し、発言者、ジャンル、言語、方言またはシチュエーションなどの制御パラメータとして直接変換してもよいし、人名、地名などを扱うＷｅｂサービスを利用して制御パラメータに変換してもよいし、オントロジー辞書を利用して制御パラメータに変換してもよい。また、分離テキスト処理部３１６は、コンテンツの再生時間と制御パラメータとの時間的な対応関係を取得できるならば、制御パラメータを適用する時間的な範囲を指定してもよい。また、分離テキスト処理部３１６は、キーワードの一部または全部を辞書パラメータに変換してもよい。更に、分離テキスト処理部３１６は、認識結果を言語モデルパラメータに変換してもよい。言語モデルパラメータは、言語モデルの更新（学習）などに利用できる。 The separated text processing unit 316 generates a control parameter based on the separated text from the separated text input unit 315 and inputs the control parameter to the speech recognition parameter control unit 306. Specifically, the separated text processing unit 316 may perform morphological analysis on the separated text. The separated text processing unit 316 converts these morphological analysis results into control parameters and inputs them to the speech recognition parameter control unit 306. For example, the separated text processing unit 316 may detect a predetermined keyword from the morphological analysis result and directly convert it as a control parameter such as a speaker, a genre, a language, a dialect, or a situation, or a Web that handles a person name, a place name, and the like. It may be converted into control parameters using a service, or may be converted into control parameters using an ontology dictionary. Further, the separated text processing unit 316 may specify a temporal range to which the control parameter is applied if the temporal correspondence between the content reproduction time and the control parameter can be acquired. Further, the separated text processing unit 316 may convert some or all of the keywords into dictionary parameters. Further, the separated text processing unit 316 may convert the recognition result into a language model parameter. The language model parameters can be used for updating (learning) the language model.

音声認識パラメータ制御部３０６は、抽出テキスト処理部１０５及び分離テキスト処理部３１６からの制御パラメータに従って認識対象の音声データのための音声認識パラメータを制御する。例えば、音声認識パラメータ制御部３０６は、「制御パラメータ：言語」または「制御パラメータ：方言」に従って音響モデル、言語モデル及び単語辞書を粗く選択し、「制御パラメータ：発言者」または「制御パラメータ：ジャンル」に従って音響モデル、言語モデル及び単語辞書をより細かく選択することができる。また、音声認識パラメータ制御部３０６は、「制御パラメータ：シチュエーション」に従って音響モデルをより細かく選択したり、ノイズキャンセリング処理の制御などを行ったりしてもよい。音声認識パラメータ制御部３０６は、辞書パラメータが示す単語を単語辞書に登録したり、この単語の認識優先度を高く設定したりしてもよい。音声認識パラメータ制御部３０６は、言語モデルパラメータに従って言語モデルを更新してもよい。音声認識パラメータ制御部３０６は、言語モデルまたは単語辞書を一時的に更新してもよいし、継続的に更新してもよい。 The speech recognition parameter control unit 306 controls speech recognition parameters for speech data to be recognized according to control parameters from the extracted text processing unit 105 and the separated text processing unit 316. For example, the speech recognition parameter control unit 306 roughly selects an acoustic model, a language model, and a word dictionary according to “control parameter: language” or “control parameter: dialect”, and “control parameter: speaker” or “control parameter: genre”. The acoustic model, language model, and word dictionary can be selected in more detail. Further, the voice recognition parameter control unit 306 may select an acoustic model in more detail according to “control parameter: situation”, or may control noise canceling processing. The voice recognition parameter control unit 306 may register the word indicated by the dictionary parameter in the word dictionary or set the recognition priority of this word to be high. The speech recognition parameter control unit 306 may update the language model according to the language model parameter. The speech recognition parameter control unit 306 may update the language model or the word dictionary temporarily or continuously.

音声認識パラメータ制御部３０６は、入力される制御パラメータの一部を音声認識パラメータの制御に使用しなくてもよい。例えば、音声認識パラメータ制御部３０６は、抽出テキスト処理部１０５及び分離テキスト処理部３１６のいずれか一方からの制御パラメータを優先的に使用してもよいし、抽出テキスト処理部１０５及び分離テキスト処理部３１６を区別せずに（例えば各制御パラメータに割り当てられた優先度に従って）制御パラメータを選定してもよい。また、音声認識パラメータ制御部３０６は、抽出テキスト処理部１０５及び分離テキスト処理部３１６の両方から同一の制御パラメータが入力される場合に、この制御パラメータを優先的に使用してもよい。 The voice recognition parameter control unit 306 may not use a part of the input control parameters for controlling the voice recognition parameters. For example, the speech recognition parameter control unit 306 may preferentially use the control parameter from either the extracted text processing unit 105 or the separated text processing unit 316, or the extracted text processing unit 105 and the separated text processing unit. Control parameters may be selected without distinguishing 316 (eg, according to the priority assigned to each control parameter). In addition, when the same control parameter is input from both the extracted text processing unit 105 and the separated text processing unit 316, the voice recognition parameter control unit 306 may preferentially use the control parameter.

以上説明したように、第３の実施形態に係る音声認識装置は、コンテンツを提供するＷｅｂページ及びコンテンツに含まれるテキストデータに基づいて音声認識パラメータを制御する。従って、本実施形態に係る音声認識装置によれば、コンテンツのための音声認識パラメータを自動制御できる。 As described above, the speech recognition apparatus according to the third embodiment controls the speech recognition parameters based on the Web page that provides the content and the text data included in the content. Therefore, the speech recognition apparatus according to the present embodiment can automatically control speech recognition parameters for content.

（第４の実施形態）
図４に示すように、第４の実施形態に係る音声認識装置は、図１の音声認識装置において音声認識パラメータ制御部１０６を音声認識パラメータ制御部４０６に、コンテンツ分離部１０９をコンテンツ分離部４０９に夫々置換し、図２の映像入力部２１３及び画像認識部２１４と図３の分離テキスト入力部３１５及び分離テキスト処理部３１６とを追加した構成に相当する。以下の説明では、図４において図１、図２または図３と同一部分には同一符号を付して示し、異なる部分を中心に述べる。 (Fourth embodiment)
As shown in FIG. 4, the speech recognition apparatus according to the fourth embodiment includes a speech recognition parameter control unit 106 in the speech recognition parameter control unit 406 and a content separation unit 109 in a content separation unit 409 in the speech recognition apparatus in FIG. This corresponds to a configuration in which the video input unit 213 and the image recognition unit 214 in FIG. 2 and the separated text input unit 315 and the separated text processing unit 316 in FIG. 3 are added. In the following description, the same parts in FIG. 4 as those in FIG. 1, FIG. 2, or FIG. 3 are denoted by the same reference numerals, and different parts will be mainly described.

コンテンツ分離部４０９は、コンテンツ解析部１０８からのメディアデータに含まれる音声データ、映像データ及びテキストデータを分離する。また、コンテンツ分離部４０９は、メタデータに含まれるテキストデータを分離してもよい。コンテンツ分離部４０９は、分離した音声データを音声入力部１１０に入力する。コンテンツ分離部４０９は、分離した映像データを映像入力部２１３に入力する。コンテンツ分離部４０９は、分離したテキストデータを分離テキスト入力部３１５に入力する。 The content separation unit 409 separates audio data, video data, and text data included in the media data from the content analysis unit 108. Further, the content separation unit 409 may separate text data included in the metadata. The content separation unit 409 inputs the separated audio data to the audio input unit 110. The content separation unit 409 inputs the separated video data to the video input unit 213. The content separation unit 409 inputs the separated text data to the separation text input unit 315.

音声認識パラメータ制御部４０６は、抽出テキスト処理部１０５、画像認識部２１４及び分離テキスト処理部３１６からの制御パラメータに従って認識対象の音声データのための音声認識パラメータを制御する。例えば、音声認識パラメータ制御部４０６は、「制御パラメータ：言語」または「制御パラメータ：方言」に従って音響モデル、言語モデル及び単語辞書を粗く選択し、「制御パラメータ：発言者」または「制御パラメータ：ジャンル」に従って音響モデル、言語モデル及び単語辞書をより細かく選択することができる。また、音声認識パラメータ制御部４０６は、「制御パラメータ：シチュエーション」に従って音響モデルをより細かく選択したり、ノイズキャンセリング処理の制御などを行ったりしてもよい。音声認識パラメータ制御部４０６は、辞書パラメータが示す単語を単語辞書に登録したり、この単語の認識優先度を高く設定したりしてもよい。音声認識パラメータ制御部４０６は、言語モデルパラメータに従って言語モデルを更新してもよい。音声認識パラメータ制御部４０６は、言語モデルまたは単語辞書を一時的に更新してもよいし、継続的に更新してもよい。 The speech recognition parameter control unit 406 controls speech recognition parameters for speech data to be recognized according to control parameters from the extracted text processing unit 105, the image recognition unit 214, and the separated text processing unit 316. For example, the voice recognition parameter control unit 406 roughly selects an acoustic model, a language model, and a word dictionary in accordance with “control parameter: language” or “control parameter: dialect”, and “control parameter: speaker” or “control parameter: genre”. The acoustic model, language model, and word dictionary can be selected in more detail. Further, the voice recognition parameter control unit 406 may select an acoustic model in more detail according to “control parameter: situation”, or may control noise canceling processing. The speech recognition parameter control unit 406 may register the word indicated by the dictionary parameter in the word dictionary, or set the recognition priority of this word high. The speech recognition parameter control unit 406 may update the language model according to the language model parameter. The speech recognition parameter control unit 406 may update the language model or the word dictionary temporarily or continuously.

音声認識パラメータ制御部４０６は、入力される制御パラメータの一部を音声認識パラメータの制御に使用しなくてもよい。例えば、音声認識パラメータ制御部４０６は、抽出テキスト処理部１０５、画像認識部２１４及び分離テキスト処理部３１６のうちの一部からの制御パラメータを優先的に使用してもよいし、抽出テキスト処理部１０５、画像認識部２１４及び分離テキスト処理部３１６を区別せずに（例えば各制御パラメータに割り当てられた優先度に従って）制御パラメータを選定してもよい。また、音声認識パラメータ制御部４０６は、抽出テキスト処理部１０５、画像認識部２１４及び分離テキスト処理部３１６のうち複数から同一の制御パラメータが入力される場合に、この制御パラメータを優先的に使用してもよい。 The voice recognition parameter control unit 406 may not use part of the input control parameters for controlling the voice recognition parameters. For example, the speech recognition parameter control unit 406 may preferentially use control parameters from some of the extracted text processing unit 105, the image recognition unit 214, and the separated text processing unit 316, or the extracted text processing unit 105, the control parameters may be selected without distinguishing between the image recognition unit 214 and the separated text processing unit 316 (for example, according to the priority assigned to each control parameter). In addition, the speech recognition parameter control unit 406 preferentially uses the control parameter when the same control parameter is input from a plurality of the extracted text processing unit 105, the image recognition unit 214, and the separated text processing unit 316. May be.

以上説明したように、第４の実施形態に係る音声認識装置は、コンテンツを提供するＷｅｂページ、コンテンツに含まれる映像データの画像認識結果及びコンテンツに含まれるテキストデータに基づいて音声認識パラメータを制御する。従って、本実施形態に係る音声認識装置によれば、コンテンツのための音声認識パラメータを自動制御できる。 As described above, the speech recognition apparatus according to the fourth embodiment controls the speech recognition parameters based on the Web page that provides the content, the image recognition result of the video data included in the content, and the text data included in the content. To do. Therefore, the speech recognition apparatus according to the present embodiment can automatically control speech recognition parameters for content.

（第５の実施形態）
図５に示すように、第５の実施形態に係る音声認識装置は、認識対象入力部５０１、コンテンツ取得部１０７、コンテンツ解析部１０８、コンテンツ分離部５０９、音声入力部５１０、第１の音声認識部５１７、映像入力部５１３、画像認識部５１４、分離テキスト入力部５１５、分離テキスト処理部５１６、Ｗｅｂページ取得部５０２、Ｗｅｂページ解析部１０３、解析パラメータ記憶部１０４、抽出テキスト処理部１０５、音声認識パラメータ制御部１０６、第２の音声認識部５１１及び認識結果出力部１１２を有する。以下の説明では、図５において図１と同一部分には同一符号を付して示し、異なる部分を中心に述べる。 (Fifth embodiment)
As shown in FIG. 5, the speech recognition apparatus according to the fifth embodiment includes a recognition target input unit 501, a content acquisition unit 107, a content analysis unit 108, a content separation unit 509, a speech input unit 510, and a first speech recognition. Unit 517, video input unit 513, image recognition unit 514, separated text input unit 515, separated text processing unit 516, web page acquisition unit 502, web page analysis unit 103, analysis parameter storage unit 104, extracted text processing unit 105, audio It has a recognition parameter control unit 106, a second speech recognition unit 511, and a recognition result output unit 112. In the following description, the same parts in FIG. 5 as those in FIG. 1 are denoted by the same reference numerals, and different parts will be mainly described.

認識対象入力部５０１は、音声認識の対象となる音声データを含むコンテンツを取得するための情報をコンテンツ取得部１０７に入力する。この情報は、コンテンツを提供するＷｅｂページの識別子に限らず、コンテンツが読み出される記憶媒体のアドレス情報、コンテンツが放送されるチャンネルなどであってもよい。 The recognition target input unit 501 inputs information for acquiring content including audio data to be subjected to voice recognition to the content acquisition unit 107. This information is not limited to the identifier of the Web page that provides the content, but may be address information of a storage medium from which the content is read, a channel through which the content is broadcast, and the like.

コンテンツ分離部５０９は、コンテンツ解析部１０８からのメディアデータに含まれる音声データ、映像データ及びテキストデータを分離する。また、コンテンツ分離部５０９は、メタデータに含まれるテキストデータを分離してもよい。コンテンツ分離部５０９は、分離した音声データを音声入力部５１０に入力する。コンテンツ分離部５０９は、分離した映像データを映像入力部５１３に入力する。コンテンツ分離部５０９は、分離したテキストデータを分離テキスト入力部５１５に入力する。 The content separation unit 509 separates audio data, video data, and text data included in the media data from the content analysis unit 108. Further, the content separation unit 509 may separate text data included in the metadata. The content separation unit 509 inputs the separated audio data to the audio input unit 510. The content separation unit 509 inputs the separated video data to the video input unit 513. The content separation unit 509 inputs the separated text data to the separation text input unit 515.

音声入力部５１０は、コンテンツ分離部５０９からの音声データを第１の音声認識部５１７及び第２の音声認識部５１１に適した形式に変換する。音声入力部５１０は、変換済みの音声データを第１の音声認識部５１７及び第２の音声認識部５１１に入力する。第１の音声認識部５１７は、音声入力部５１０からの音声データに対して音声認識を行う。第１の音声認識部５１７は、認識結果に含まれる単語またはＷｅｂページの識別子を抽出し、Ｗｅｂページ取得部５０２に入力する。 The voice input unit 510 converts the voice data from the content separation unit 509 into a format suitable for the first voice recognition unit 517 and the second voice recognition unit 511. The voice input unit 510 inputs the converted voice data to the first voice recognition unit 517 and the second voice recognition unit 511. The first voice recognition unit 517 performs voice recognition on the voice data from the voice input unit 510. The first speech recognition unit 517 extracts a word or web page identifier included in the recognition result and inputs the extracted word or web page identifier to the web page acquisition unit 502.

映像入力部５１３は、コンテンツ分離部５０９からの映像データを画像認識部５１４に適した形式に変換する。映像入力部５１３は、変換済みの映像データを画像認識部５１４に入力する。尚、映像データ中の一部のフレームに対して画像認識を省略するために、映像入力部５１３は、コンテンツ分離部５０９からの映像データ中のフレームを間引いてもよい。 The video input unit 513 converts the video data from the content separation unit 509 into a format suitable for the image recognition unit 514. The video input unit 513 inputs the converted video data to the image recognition unit 514. Note that the video input unit 513 may thin out frames in the video data from the content separation unit 509 in order to omit image recognition for some frames in the video data.

画像認識部５１４は、映像入力部５１３からの映像データに対して画像認識を行う。画像認識部５１４は、認識結果から単語またはＷｅｂページの識別子を抽出し、Ｗｅｂページ取得部５０２に入力する。具体的には、画像認識部５１４は映像中に表示されたテキスト（例えば、テロップ、番組出演者の名前、コンテンツに関連するＷｅｂページのＵＲＬなど）を認識する。画像認識部５１４は、この認識結果に含まれる単語またはＷｅｂページの識別子をＷｅｂページ取得部５０２に入力する。 The image recognition unit 514 performs image recognition on the video data from the video input unit 513. The image recognition unit 514 extracts a word or Web page identifier from the recognition result and inputs the extracted word or Web page identifier to the Web page acquisition unit 502. Specifically, the image recognition unit 514 recognizes text (for example, a telop, the name of a program performer, a URL of a Web page related to the content, etc.) displayed in the video. The image recognition unit 514 inputs a word or Web page identifier included in the recognition result to the Web page acquisition unit 502.

また、画像認識部５１４は、文字のサイズ、形状（フォント）、画面内位置、表示間隔などに応じて、単語またはＷｅｂページの識別子に優先度を割り当ててもよい。優先度は、単語またはＷｅｂページの識別子の数が過剰である場合などに、有効とする単語またはＷｅｂページの識別子の選定するために利用できる。例えば、文字のサイズが大きいほど高い優先度を割り当てたり、文字の形状が太字などの強調表示に相当するものであれば高い優先度を割り当てたり、特定の画面内位置（例えば、番組出演者の名前が表示されやすい画面下部など）に高い優先度を割り当てたりしてもよい。或いは、画像認識部５１４は、文字に限らず特定の放送局、番組、人物、企業、団体、商品、サービスなどを表す特定のマーク（ロゴ）を認識し、対応する単語または対応するＷｅｂページの識別子に変換してもよい。 In addition, the image recognition unit 514 may assign a priority to a word or Web page identifier according to a character size, a shape (font), a screen position, a display interval, and the like. The priority can be used to select valid word or web page identifiers when the number of word or web page identifiers is excessive. For example, a higher priority is assigned to a larger character size, or a higher priority is assigned if the character shape corresponds to highlighting such as bold, or a specific on-screen position (for example, High priority may be assigned to the bottom of the screen where names are easily displayed. Alternatively, the image recognizing unit 514 recognizes a specific mark (logo) representing a specific broadcasting station, program, person, company, organization, product, service, etc., not limited to characters, and a corresponding word or a corresponding Web page. You may convert into an identifier.

分離テキスト入力部５１５は、コンテンツ分離部５０９からの分離テキストデータを分離テキスト処理部５１６に適した形式に変換する。分離テキスト入力部５１５は、変換済みの分離テキストデータを分離テキスト処理部５１６に入力する。 The separated text input unit 515 converts the separated text data from the content separating unit 509 into a format suitable for the separated text processing unit 516. The separated text input unit 515 inputs the converted separated text data to the separated text processing unit 516.

分離テキスト処理部５１６は、分離テキスト入力部５１５からの分離テキストから単語またはＷｅｂページの識別子を抽出し、Ｗｅｂページ取得部５０２に入力する。具体的には、分離テキスト処理部５１６は、分離テキストに含まれる単語またはＷｅｂページの識別子を抽出する。 The separated text processing unit 516 extracts a word or web page identifier from the separated text from the separated text input unit 515 and inputs the extracted word or web page identifier to the web page acquisition unit 502. Specifically, the separated text processing unit 516 extracts a word or Web page identifier included in the separated text.

Ｗｅｂページ取得部５０２は、第１の音声認識部５１７、画像認識部５１４及び分離テキスト処理部５１６からの単語またはＷｅｂページの識別子に基づいてコンテンツに関連するＷｅｂページを取得する。具体的には、Ｗｅｂページ取得部５０２は、単語が入力された場合には、この単語を使用して検索式を生成する。Ｗｅｂページ取得部５０２は、この検索式を所定の検索エンジンに送信し、検索結果からＷｅｂページを取得する。一方、Ｗｅｂページ取得部５０２は、Ｗｅｂページの識別子が入力された場合には、このＷｅｂページの識別子に従ってＷｅｂページを取得する。Ｗｅｂページ取得部５０２は、取得したＷｅｂページをＷｅｂページ解析部１０３に入力する。 The web page acquisition unit 502 acquires a web page related to the content based on the word or web page identifier from the first speech recognition unit 517, the image recognition unit 514, and the separated text processing unit 516. Specifically, when a word is input, the Web page acquisition unit 502 generates a search expression using this word. The web page acquisition unit 502 transmits this search formula to a predetermined search engine, and acquires a web page from the search result. On the other hand, when a web page identifier is input, the web page acquisition unit 502 acquires the web page according to the web page identifier. The web page acquisition unit 502 inputs the acquired web page to the web page analysis unit 103.

また、Ｗｅｂページ取得部５０２は、検索式に含める単語の数、検索結果から取得するＷｅｂページの数、Ｗｅｂページの識別子に従って取得するＷｅｂページの数などを制限してもよい。例えば、Ｗｅｂページ取得部５０２は、第１の音声認識部５１７、画像認識部５１４及び分離テキスト処理部５１６のうち一部からの単語を優先的に検索式に含めてもよいし、これらのうち一部からのＷｅｂページの識別子を優先的に選択してＷｅｂページを取得してもよい。或いは、Ｗｅｂページ取得部５０２は、第１の音声認識部５１７、画像認識部５１４及び分離テキスト処理部５１６を区別せずに（例えば各単語に割り当てられた優先度に従って）各単語を重み付けして検索式を生成してもよい。ここで、重み付けすることとは、例えば、優先度の高い順に所定個数以下の単語を組み合わせること、優先度が所定値以上の単語を組み合わせることなどを意味する。また、Ｗｅｂページ取得部５０２は、第１の音声認識部５１７、画像認識部５１４及び分離テキスト処理部５１６のうち複数から同一の単語または同一のＷｅｂページの識別子が入力される場合に、この単語またはＷｅｂページの識別子を優先的に使用してもよい。 Further, the Web page acquisition unit 502 may limit the number of words included in the search expression, the number of Web pages acquired from the search result, the number of Web pages acquired according to the Web page identifier, and the like. For example, the web page acquisition unit 502 may preferentially include words from some of the first speech recognition unit 517, the image recognition unit 514, and the separated text processing unit 516, among these, A web page may be acquired by preferentially selecting identifiers of some web pages. Alternatively, the web page acquisition unit 502 weights each word without distinguishing the first speech recognition unit 517, the image recognition unit 514, and the separated text processing unit 516 (for example, according to the priority assigned to each word). A search expression may be generated. Here, weighting means, for example, combining words of a predetermined number or less in descending order of priority, or combining words having a priority of a predetermined value or more. In addition, the Web page acquisition unit 502 receives the same word or the identifier of the same Web page from a plurality of the first speech recognition unit 517, the image recognition unit 514, and the separated text processing unit 516. Alternatively, Web page identifiers may be used preferentially.

第２の音声認識部５１１は、認識対象となる音声データに関して前述の音声認識パラメータ制御部１０６の処理が完了してから、音声入力部５１０からの音声データに対して音声認識を行う。第２の音声認識部５１１は、認識結果を認識結果出力部１１２に入力する。尚、第１の音声認識部５１７及び第２の音声認識部５１１は、別個のモジュールであってもよいし、一体化されたモジュールであってもよい。 The second voice recognition unit 511 performs voice recognition on the voice data from the voice input unit 510 after the processing of the voice recognition parameter control unit 106 is completed for the voice data to be recognized. The second voice recognition unit 511 inputs the recognition result to the recognition result output unit 112. The first voice recognition unit 517 and the second voice recognition unit 511 may be separate modules or integrated modules.

以上説明したように第５の実施形態に係る音声認識装置は、コンテンツから分離された音声データに対する音声認識結果、コンテンツから分離された映像データに対する画像認識結果、コンテンツから分離されたテキストなどに基づいてコンテンツに関連するＷｅｂページを取得し、この関連するＷｅｂページに基づいて音声認識パラメータを制御する。従って、本実施形態に係る音声認識装置によれば、コンテンツを提供するＷｅｂページが存在しない場合、不明な場合などにも、コンテンツのための音声認識パラメータを自動制御できる。 As described above, the speech recognition apparatus according to the fifth embodiment is based on the speech recognition result for the audio data separated from the content, the image recognition result for the video data separated from the content, the text separated from the content, and the like. The web page related to the content is acquired, and the speech recognition parameter is controlled based on the related web page. Therefore, according to the speech recognition apparatus according to the present embodiment, it is possible to automatically control speech recognition parameters for content even when there is no Web page providing the content or when it is unknown.

本実施形態に係る音声認識装置は、コンテンツに含まれる音声データ、映像データ及びテキストデータを利用してコンテンツに関連するＷｅｂページを検索している。しかしながら、必ずしもこれら全てを利用しなくても、本実施形態に係る音声認識装置と類似の効果を得ることができる。音声データを利用しない場合には、図５において第１の音声認識部５１７は除去されてよい。映像データを利用しない場合には、図５において映像入力部５１３及び画像認識部５１４は除去されてよい。テキストデータを利用しない場合には分離テキスト入力部５１５及び分離テキスト処理部５１６は除去されてよい。 The speech recognition apparatus according to the present embodiment searches for a web page related to content using audio data, video data, and text data included in the content. However, an effect similar to that of the speech recognition apparatus according to the present embodiment can be obtained without necessarily using all of them. When voice data is not used, the first voice recognition unit 517 in FIG. 5 may be removed. When the video data is not used, the video input unit 513 and the image recognition unit 514 in FIG. 5 may be removed. When text data is not used, the separated text input unit 515 and the separated text processing unit 516 may be removed.

（第６の実施形態）
図６に示すように、第６の実施形態に係る音声認識装置は、図５の認識対象入力部５０１を認識対象入力部６０１に、Ｗｅｂページ取得部１０２をＷｅｂページ取得部６０２に夫々置換した構成に相当する。以下の説明では、図６において図５と同一部分には同一符号を付して示し、異なる部分を中心に述べる。 (Sixth embodiment)
As shown in FIG. 6, the speech recognition apparatus according to the sixth embodiment replaces the recognition target input unit 501 in FIG. 5 with the recognition target input unit 601 and the web page acquisition unit 102 with the web page acquisition unit 602. Corresponds to the configuration. In the following description, the same parts in FIG. 6 as those in FIG.

認識対象入力部６０１は、音声認識の対象となる音声データを含むコンテンツを提供するＷｅｂページの識別子をＷｅｂページ取得部６０２及びコンテンツ取得部１０７に入力する。 The recognition target input unit 601 inputs, to the Web page acquisition unit 602 and the content acquisition unit 107, an identifier of a Web page that provides content including audio data that is a target of voice recognition.

Ｗｅｂページ取得部６０２は、認識対象入力部６０１からのＷｅｂページの識別子に従ってＷｅｂページを取得する。また、Ｗｅｂページ取得部６０２は、Ｗｅｂページ取得部５０２と同様に、第１の音声認識部５１７、画像認識部５１４及び分離テキスト処理部５１６からの単語またはＷｅｂページの識別子に基づいてコンテンツに関連するＷｅｂページを取得する。Ｗｅｂページ取得部６０２は、取得したＷｅｂページをＷｅｂページ解析部１０３に入力する。 The web page acquisition unit 602 acquires a web page according to the web page identifier from the recognition target input unit 601. Similarly to the Web page acquisition unit 502, the Web page acquisition unit 602 relates to content based on words or Web page identifiers from the first speech recognition unit 517, the image recognition unit 514, and the separated text processing unit 516. Web page to be acquired is acquired. The web page acquisition unit 602 inputs the acquired web page to the web page analysis unit 103.

一例として、Ｗｅｂページ取得部６０２は、最初に、認識対象入力部６０１からのＷｅｂページの識別子に従ってＷｅｂページを取得する。そして、Ｗｅｂページ取得部６０２は、このＷｅｂページに関して抽出テキスト処理部１０５が十分な制御パラメータを得られなければ、第１の音声認識部５１７、画像認識部５１４及び分離テキスト処理部５１６からの単語またはＷｅｂページの識別子に基づいてコンテンツに関連するＷｅｂページを追加的に取得してもよい。 As an example, the Web page acquisition unit 602 first acquires a Web page according to the Web page identifier from the recognition target input unit 601. If the extracted text processing unit 105 cannot obtain sufficient control parameters for the Web page, the web page acquisition unit 602 uses words from the first speech recognition unit 517, the image recognition unit 514, and the separated text processing unit 516. Or you may acquire additionally the web page relevant to a content based on the identifier of a web page.

別の例として、Ｗｅｂページ取得部６０２は、最初に、第１の音声認識部５１７、画像認識部５１４及び分離テキスト処理部５１６からの単語またはＷｅｂページの識別子に基づいてコンテンツに関連するＷｅｂページを取得する。そして、Ｗｅｂページ取得部６０２は、このＷｅｂページに関して抽出テキスト処理部１０５が十分な制御パラメータを得られなければ、認識対象入力部６０１からのＷｅｂページの識別子に従ってＷｅｂページを追加的に取得してもよい。 As another example, the web page acquisition unit 602 first creates a web page related to content based on a word or web page identifier from the first speech recognition unit 517, the image recognition unit 514, and the separated text processing unit 516. To get. If the extracted text processing unit 105 cannot obtain sufficient control parameters for the Web page, the Web page acquisition unit 602 additionally acquires the Web page according to the Web page identifier from the recognition target input unit 601. Also good.

別の例として、Ｗｅｂページ取得部６０２は、認識対象入力部６０１からのＷｅｂページの識別子に従うＷｅｂページと、第１の音声認識部５１７、画像認識部５１４及び分離テキスト処理部５１６からの単語またはＷｅｂページの識別子に基づくコンテンツに関連するＷｅｂページとの両方を並列的に取得してもよい。 As another example, the web page acquisition unit 602 includes a web page according to the web page identifier from the recognition target input unit 601, a word from the first speech recognition unit 517, the image recognition unit 514, and the separated text processing unit 516. You may acquire in parallel with both the web page relevant to the content based on the identifier of a web page.

以上説明したように第６の実施形態に係る音声認識装置は、コンテンツを提供するＷｅｂページ及びコンテンツに関連するＷｅｂページの少なくとも一方に基づいて音声認識パラメータを制御する。従って、本実施形態に係る音声認識装置によれば、音声認識パラメータを自動制御できる。 As described above, the speech recognition apparatus according to the sixth embodiment controls the speech recognition parameters based on at least one of a web page that provides content and a web page related to the content. Therefore, according to the speech recognition apparatus according to the present embodiment, speech recognition parameters can be automatically controlled.

本実施形態に係る音声認識装置は、コンテンツに含まれる音声データ、映像データ及びテキストデータを利用してコンテンツに関連するＷｅｂページを検索している。しかしながら、必ずしもこれら全てを利用しなくても、本実施形態に係る音声認識装置と類似の効果を得ることができる。音声データを利用しない場合には、図６において第１の音声認識部５１７は除去されてよい。映像データを利用しない場合には、図６において映像入力部５１３及び画像認識部５１４は除去されてよい。テキストデータを利用しない場合には分離テキスト入力部５１５及び分離テキスト処理部５１６は除去されてよい。 The speech recognition apparatus according to the present embodiment searches for a web page related to content using audio data, video data, and text data included in the content. However, an effect similar to that of the speech recognition apparatus according to the present embodiment can be obtained without necessarily using all of them. When voice data is not used, the first voice recognition unit 517 in FIG. 6 may be removed. When video data is not used, the video input unit 513 and the image recognition unit 514 in FIG. 6 may be removed. When text data is not used, the separated text input unit 515 and the separated text processing unit 516 may be removed.

尚、本発明は上記各実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また上記各実施形態に開示されている複数の構成要素を適宜組み合わせることによって種々の発明を形成できる。また例えば、各実施形態に示される全構成要素からいくつかの構成要素を削除した構成も考えられる。さらに、異なる実施形態に記載した構成要素を適宜組み合わせてもよい。 Note that the present invention is not limited to the above-described embodiments as they are, and can be embodied by modifying the components without departing from the scope of the invention in the implementation stage. Various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the above embodiments. Further, for example, a configuration in which some components are deleted from all the components shown in each embodiment is also conceivable. Furthermore, you may combine suitably the component described in different embodiment.

例えば、上記各実施形態の処理を実現するプログラムを、コンピュータで読み取り可能な記憶媒体に格納して提供することも可能である。記憶媒体としては、磁気ディスク、光ディスク（ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＤＶＤ等）、光磁気ディスク（ＭＯ等）、半導体メモリなど、プログラムを記憶でき、かつ、コンピュータが読み取り可能な記憶媒体であれば、その記憶形式は何れの形態であってもよい。 For example, it is possible to provide a program that realizes the processing of each of the above embodiments by storing it in a computer-readable storage medium. The storage medium may be a computer-readable storage medium such as a magnetic disk, optical disk (CD-ROM, CD-R, DVD, etc.), magneto-optical disk (MO, etc.), semiconductor memory, etc. For example, the storage format may be any form.

また、上記各実施形態の処理を実現するプログラムを、インターネットなどのネットワークに接続されたコンピュータ（サーバ）上に格納し、ネットワーク経由でコンピュータ（クライアント）にダウンロードさせてもよい。 Further, the program for realizing the processing of each of the above embodiments may be stored on a computer (server) connected to a network such as the Internet and downloaded to the computer (client) via the network.

１０１・・・認識対象入力部
１０２・・・Ｗｅｂページ取得部
１０３・・・Ｗｅｂページ解析部
１０４・・・解析パラメータ記憶部
１０５・・・抽出テキスト処理部
１０６・・・音声認識パラメータ制御部
１０７・・・コンテンツ取得部
１０８・・・コンテンツ解析部
１０９・・・コンテンツ分離部
１１０・・・音声入力部
１１１・・・音声認識部
１１２・・・認識結果出力部
２０６・・・音声認識パラメータ制御部
２０９・・・コンテンツ分離部
２１３・・・映像入力部
２１４・・・画像認識部
３０６・・・音声認識パラメータ制御部
３０９・・・コンテンツ分離部
３１５・・・分離テキスト入力部
３１６・・・分離テキスト処理部
４０６・・・音声認識パラメータ制御部
４０９・・・コンテンツ分離部
５０１・・・認識対象入力部
５０２・・・Ｗｅｂページ取得部
５０９・・・コンテンツ分離部
５１０・・・音声入力部
５１１・・・第２の音声認識部
５１３・・・映像入力部
５１４・・・画像認識部
５１５・・・分離テキスト入力部
５１６・・・分離テキスト処理部
５１７・・・第１の音声認識部
６０１・・・認識対象入力部
６０２・・・Ｗｅｂページ取得部 DESCRIPTION OF SYMBOLS 101 ... Recognition object input part 102 ... Web page acquisition part 103 ... Web page analysis part 104 ... Analysis parameter memory | storage part 105 ... Extraction text processing part 106 ... Speech recognition parameter control part 107 ... Content acquisition unit 108 ... Content analysis unit 109 ... Content separation unit 110 ... Voice input unit 111 ... Voice recognition unit 112 ... Recognition result output unit 206 ... Voice recognition parameter control 209 ... Content separation unit 213 ... Video input unit 214 ... Image recognition unit 306 ... Speech recognition parameter control unit 309 ... Content separation unit 315 ... Separated text input unit 316 ... Separate text processing unit 406... Speech recognition parameter control unit 409... Content separation unit 501. Elephant input unit 502 ... Web page acquisition unit 509 ... Content separation unit 510 ... Audio input unit 511 ... Second audio recognition unit 513 ... Video input unit 514 ... Image recognition unit 515 ... Separated text input unit 516 ... Separated text processing unit 517 ... First speech recognition unit 601 ... Recognition target input unit 602 ... Web page acquisition unit

Claims

A content acquisition unit for acquiring content including audio data;
A web page acquisition unit for acquiring a web page providing the content;
A web page analysis unit that performs analysis based on a web page that provides the content and extracts text indicating characteristics of the audio data;
A parameter control unit for controlling speech recognition parameters for the speech data based on the extracted text;
A speech recognition apparatus comprising: a speech recognition unit that performs speech recognition on the speech data in accordance with a controlled speech recognition parameter.

A storage unit that associates and stores the identifier of the Web page and the analysis target and extraction method for extracting the text;
The web page analysis unit extracts the text from an analysis target corresponding to an identifier of a web page providing the content according to an extraction method corresponding to the identifier of the web page providing the content.
The speech recognition apparatus according to claim 1.

A storage unit that stores the identifier of the Web page, the narrowing condition, and the application target of the narrowing condition in association with each other;
The Web page analysis unit extracts the text if the narrowing condition corresponding to the identifier of the Web page providing the content is not satisfied in the application target of the narrowing condition corresponding to the identifier of the Web page providing the content. Is omitted,
The speech recognition apparatus according to claim 1.

The storage unit further stores a processing method for converting the extracted text into a control parameter in association with the identifier of the Web page,
The control unit controls the speech recognition parameter according to a control parameter in which the extracted text is converted according to a processing method corresponding to an identifier of a Web page that provides the content.
The speech recognition apparatus according to claim 2.

If the Web page providing the content is a predetermined Web page, the Web page analyzing unit searches for another Web page using the extracted text, and determines a predetermined Web page corresponding to the identifier of the other Web page. The speech recognition apparatus according to claim 1, wherein the text is extracted from an analysis target according to a predetermined extraction method.

The speech recognition apparatus according to claim 1, wherein the parameter control unit controls the speech recognition parameter based on a description position of the text.

The speech recognition apparatus according to claim 1, wherein the parameter control unit controls the speech recognition parameter based on the predetermined keyword if the text includes a predetermined keyword.

A content acquisition unit for acquiring content including audio data;
A web page acquisition unit that acquires a web page related to the content based on at least one of a speech recognition result of the audio data, an image recognition result of video data separated from the content, and text data separated from the content When,
A web page analysis unit that performs analysis based on a web page related to the content and extracts text indicating characteristics of the audio data;
A parameter control unit for controlling speech recognition parameters for the speech data based on the extracted text;
A speech recognition apparatus comprising: a speech recognition unit that performs speech recognition on the speech data in accordance with a controlled speech recognition parameter.

The web page acquisition unit weights at least one of a first word included in the speech recognition result, a second word included in the image recognition result, and a third word included in the text data. The speech recognition apparatus according to claim 8, wherein a search expression for searching for a Web page related to the content is generated.

The speech recognition apparatus according to claim 9, wherein the Web page acquisition unit performs weighting on the second word based on a character size, a shape, or an in-screen position.

9. The voice recognition according to claim 8, wherein if the image recognition result matches a predetermined mark, the Web page acquisition unit acquires a Web page related to the content according to a predetermined identifier corresponding to the predetermined mark. apparatus.