JP7009338B2

JP7009338B2 - Information processing equipment, information processing systems, and video equipment

Info

Publication number: JP7009338B2
Application number: JP2018175656A
Authority: JP
Inventors: 哲尾崎
Original assignee: TVS Regza Corp
Current assignee: TVS Regza Corp
Priority date: 2018-09-20
Filing date: 2018-09-20
Publication date: 2022-01-25
Anticipated expiration: 2038-09-20
Also published as: CN112236816A; JP2020046564A; WO2020057467A1; CN112236816B

Description

本発明の実施形態は、情報処理装置、情報処理システム、および映像装置に関する。 Embodiments of the present invention relate to an information processing device, an information processing system, and a video device.

従来から、発話された音声を、音声認識によって文字データなどに変換する技術が知られている。 Conventionally, there has been known a technique for converting spoken voice into character data or the like by voice recognition.

また、このような音声認識によって認識された結果に基づいて、ＡＶ機器等を操作する音声アシスタントサービスの技術が知られている。 Further, a technique of a voice assistant service for operating an AV device or the like based on the result recognized by such voice recognition is known.

特開２００４－０２９３１７号公報Japanese Unexamined Patent Publication No. 2004-029317

しかしながら、従来技術においては、汎用的な音声アシスタントサービスを利用する場合に、番組に関する情報を高精度に音声認識することが困難な場合があった。 However, in the prior art, when using a general-purpose voice assistant service, it may be difficult to recognize information about a program by voice with high accuracy.

実施形態の情報処理装置は、取得部と、判断部と、置換部とを備える。取得部は、発話された音声が第１の音声認識装置によって音声認識された第１の音声認識データと、第１の音声認識データの構文解析結果とを取得する。判断部は、構文解析結果に基づいて、第１の音声認識データが番組に関する第１の番組情報を含むか否かを判断する。置換部は、判断部によって第１の音声認識データが第１の番組情報を含むと判断された場合に、番組に関する情報が登録された辞書を有する第２の音声認識装置によって音声が音声認識された第２の音声認識データに含まれる第２の番組情報を取得し、第１の音声認識データに含まれる第１の番組情報を第２の番組情報に置換する。 The information processing apparatus of the embodiment includes an acquisition unit, a determination unit, and a replacement unit. The acquisition unit acquires the first voice recognition data in which the spoken voice is voice-recognized by the first voice recognition device, and the syntax analysis result of the first voice recognition data. The determination unit determines whether or not the first voice recognition data includes the first program information regarding the program based on the syntax analysis result. When the determination unit determines that the first voice recognition data includes the first program information, the replacement unit voice-recognizes the voice by a second voice recognition device having a dictionary in which information about the program is registered. The second program information included in the second voice recognition data is acquired, and the first program information included in the first voice recognition data is replaced with the second program information.

図１は、実施形態にかかる情報処理システムの全体構成の一例を示す図である。FIG. 1 is a diagram showing an example of the overall configuration of the information processing system according to the embodiment. 図２は、実施形態にかかるテレビジョン装置のハードウェア構成の一例を示す図である。FIG. 2 is a diagram showing an example of the hardware configuration of the television device according to the embodiment. 図３は、実施形態にかかるテレビジョン装置が有する機能の一例を示す図である。FIG. 3 is a diagram showing an example of a function of the television device according to the embodiment. 図４は、実施形態にかかるテレビジョン装置から送信される音声情報の一例を示す図である。FIG. 4 is a diagram showing an example of audio information transmitted from the television apparatus according to the embodiment. 図５は、実施形態にかかる番組情報認識サーバが有する機能の一例を示す図である。FIG. 5 is a diagram showing an example of a function of the program information recognition server according to the embodiment. 図６は、実施形態にかかる辞書の一例を示す図である。FIG. 6 is a diagram showing an example of a dictionary according to an embodiment. 図７は、実施形態にかかる音声認識結果の一例を示す図である。FIG. 7 is a diagram showing an example of the voice recognition result according to the embodiment. 図８は、実施形態にかかる記憶サーバが有する機能の一例を示す図である。FIG. 8 is a diagram showing an example of a function of the storage server according to the embodiment. 図９は、実施形態にかかる意図判断サーバが有する機能の一例を示す図である。FIG. 9 is a diagram showing an example of a function of the intention determination server according to the embodiment. 図１０は、実施形態にかかる音声認識処理の流れの一例を示すシーケンス図である。FIG. 10 is a sequence diagram showing an example of the flow of voice recognition processing according to the embodiment. 図１１は、変形例１にかかる辞書の一例を示す図である。FIG. 11 is a diagram showing an example of the dictionary according to the modified example 1. 図１２は、変形例２にかかる情報処理システムの全体構成の一例を示す図である。FIG. 12 is a diagram showing an example of the overall configuration of the information processing system according to the modified example 2.

図１は、本実施形態にかかる情報処理システムＳ１の全体構成の一例を示す図である。図１に示すように、情報処理システムＳ１は、テレビジョン装置１０と、番組情報認識サーバ２０と、意図判断サーバ３０と、記憶サーバ４０とを備える。 FIG. 1 is a diagram showing an example of the overall configuration of the information processing system S1 according to the present embodiment. As shown in FIG. 1, the information processing system S1 includes a television device 10, a program information recognition server 20, an intention determination server 30, and a storage server 40.

情報処理システムＳ１に含まれる各装置は、インターネット等のネットワークによって接続している。また、テレビジョン装置１０と、意図判断サーバ３０とは、音声アシスタントサーバ５０とネットワークによって接続している。なお、情報処理システムＳ１は、音声アシスタントサーバ５０を含むものとしても良い。 Each device included in the information processing system S1 is connected by a network such as the Internet. Further, the television device 10 and the intention determination server 30 are connected to the voice assistant server 50 by a network. The information processing system S1 may include the voice assistant server 50.

テレビジョン装置１０は、マイク等の音声入力機器を備え、ユーザが発話した音声を入力する。テレビジョン装置１０は、入力した音声を音声信号として音声アシスタントサーバ５０と番組情報認識サーバ２０とに送信する。また、テレビジョン装置１０は、後述の番組情報認識サーバ２０から送信された音声認識結果を受信し、当該受信した音声認識結果を記憶サーバ４０に送信する。また、テレビジョン装置１０は、後述の意図判断サーバ３０から受信した指示信号に従って動作する。テレビジョン装置１０は、本実施形態における映像装置の一例である。 The television device 10 includes a voice input device such as a microphone, and inputs the voice spoken by the user. The television device 10 transmits the input audio as an audio signal to the audio assistant server 50 and the program information recognition server 20. Further, the television device 10 receives the voice recognition result transmitted from the program information recognition server 20 described later, and transmits the received voice recognition result to the storage server 40. Further, the television device 10 operates according to an instruction signal received from the intention determination server 30 described later. The television device 10 is an example of a video device according to the present embodiment.

音声アシスタントサーバ５０は、汎用的な音声アシスタントサービスを実行する装置である。例えば、音声アシスタントサーバ５０は、テレビジョン装置１０から受信した音声信号を音声認識し、当該音声認識の結果に基づいてインターネットの検索や、各種家電の制御等を実行する。また、音声アシスタントサーバ５０は、音声信号を音声認識によって文字データに変換し、当該文字データに対して構文解析を行う。 The voice assistant server 50 is a device that executes a general-purpose voice assistant service. For example, the voice assistant server 50 recognizes a voice signal received from the television device 10 by voice, searches the Internet based on the result of the voice recognition, controls various home appliances, and the like. Further, the voice assistant server 50 converts the voice signal into character data by voice recognition, and performs syntactic analysis on the character data.

本実施形態においては、音声信号が音声アシスタントサーバ５０の音声認識によって文字に変換された文字データを、第１の音声認識データという。また、第１の音声認識データに含まれる目的語または述語を特定する情報を、構文解析結果という。第１の音声認識データおよび構文解析結果の詳細については後述する。 In the present embodiment, the character data in which the voice signal is converted into characters by the voice recognition of the voice assistant server 50 is referred to as the first voice recognition data. Further, the information for specifying the object or predicate included in the first speech recognition data is referred to as a syntactic analysis result. The details of the first speech recognition data and the syntax analysis result will be described later.

音声アシスタントサーバ５０は、第１の音声認識データと、第１の音声認識データに対する構文解析結果とを、意図判断サーバ３０に送信する。音声アシスタントサーバ５０は、本実施形態における第１の音声認識装置および他の音声認識装置の一例である。 The voice assistant server 50 transmits the first voice recognition data and the parsing result for the first voice recognition data to the intention determination server 30. The voice assistant server 50 is an example of the first voice recognition device and other voice recognition devices in the present embodiment.

番組情報認識サーバ２０は、番組コンテンツ（以下、番組）に関する情報（以下、番組情報）が登録された辞書を記憶しており、当該辞書に基づいて、テレビジョン装置１０から受信した音声信号を音声認識する装置である。番組情報認識サーバ２０は、第２の音声認識データと、番組情報の特定結果とをテレビジョン装置１０に送信する。また、番組情報認識サーバ２０は、本実施形態における第２の音声認識装置および音声認識装置の一例である。 The program information recognition server 20 stores a dictionary in which information (hereinafter, program information) related to program content (hereinafter, program) is registered, and based on the dictionary, the audio signal received from the television device 10 is voiced. It is a recognition device. The program information recognition server 20 transmits the second voice recognition data and the specific result of the program information to the television device 10. Further, the program information recognition server 20 is an example of the second voice recognition device and the voice recognition device in the present embodiment.

番組情報は、番組に関する情報であり、番組タイトルと、番組のジャンルと、番組の出演者名と、のいずれかについての情報を含む。例えば本実施形態においては、番組情報は、番組タイトルとする。 The program information is information about a program, and includes information about any one of a program title, a program genre, and a program performer name. For example, in the present embodiment, the program information is a program title.

第２の音声認識データは、番組情報認識サーバ２０による音声認識結果であり、より詳細には、番組情報認識サーバ２０が、番組情報が登録された辞書に基づいて音声信号を文字に変換した文字データである。また、番組情報の特定結果は、第２の音声認識データのうちの番組情報に該当する箇所を特定する情報である。ここで、第２の音声認識データに含まれる番組情報を、第２の番組情報という。 The second voice recognition data is the voice recognition result by the program information recognition server 20, and more specifically, the character in which the program information recognition server 20 converts the voice signal into characters based on the dictionary in which the program information is registered. It is data. Further, the specific result of the program information is information for specifying a portion corresponding to the program information in the second voice recognition data. Here, the program information included in the second voice recognition data is referred to as the second program information.

記憶サーバ４０は、番組情報認識サーバ２０によって音声認識された第２の音声認識データと番組情報の特定結果とをテレビジョン装置１０を介して取得し、これらの情報を記憶する。記憶サーバ４０は、本実施形態における記憶装置および外部装置の一例である。 The storage server 40 acquires the second voice recognition data voice-recognized by the program information recognition server 20 and the specific result of the program information via the television device 10, and stores the information. The storage server 40 is an example of a storage device and an external device in the present embodiment.

意図判断サーバ３０は、発話された音声がテレビジョン装置１０の操作を意図する音声命令であるか否かを判断する。具体的には、意図判断サーバ３０は、音声アシスタントサーバ５０から第１の音声認識データと構文解析結果とを取得し、当該第１の音声認識データに番組情報が含まれているか否かを判断する。意図判断サーバ３０は、該第１の音声認識データに番組情報が含まれていると判断した場合は、発話された音声がテレビジョン装置１０の操作を意図する音声命令であると判断する。ここで、第１の音声認識データに含まれる番組情報を第１の番組情報という。 The intention determination server 30 determines whether or not the spoken voice is a voice command intended to operate the television device 10. Specifically, the intention determination server 30 acquires the first voice recognition data and the syntax analysis result from the voice assistant server 50, and determines whether or not the program information is included in the first voice recognition data. do. When the intention determination server 30 determines that the first voice recognition data includes program information, it determines that the spoken voice is a voice command intended to operate the television device 10. Here, the program information included in the first voice recognition data is referred to as the first program information.

意図判断サーバ３０は、第１の音声認識データに番組情報が含まれていると判断した場合に、記憶サーバ４０から第２の番組情報を取得し、第１の音声認識データに含まれる第１の番組情報を、第２の番組情報に置換する。当該置換処理後の音声認識データを、第３の音声認識データという。 When the intention determination server 30 determines that the program information is included in the first voice recognition data, the intention determination server 30 acquires the second program information from the storage server 40, and the first voice recognition data includes the first program information. The program information of is replaced with the second program information. The voice recognition data after the replacement process is referred to as a third voice recognition data.

また、意図判断サーバ３０は、置換処理後の音声認識データ（第３の音声認識データ）に基づいて、テレビジョン装置１０に対して指示信号を送信してテレビジョン装置１０を制御する。指示信号は、テレビジョン装置１０に対して動作を指示する信号であり、例えば、録画する番組や、再生する番組を指定する命令を含む。意図判断サーバ３０は、本実施形態における情報処理装置の一例である。 Further, the intention determination server 30 controls the television device 10 by transmitting an instruction signal to the television device 10 based on the voice recognition data (third voice recognition data) after the replacement process. The instruction signal is a signal instructing the television device 10 to operate, and includes, for example, an instruction to specify a program to be recorded or a program to be reproduced. The intention determination server 30 is an example of the information processing device in the present embodiment.

本実施形態の番組情報認識サーバ２０と、意図判断サーバ３０と、記憶サーバ４０と、音声アシスタントサーバ５０とは、ＣＰＵなどの制御装置と、ＲＯＭ（Read Only Memory）やＲＡＭなどの記憶装置と、ＨＤＤ、ＣＤドライブ装置などの外部記憶装置とを備えており、通常のコンピュータを利用したハードウェア構成となっている。また、本実施形態の番組情報認識サーバ２０と、意図判断サーバ３０と、記憶サーバ４０と、音声アシスタントサーバ５０とは、例えばネットワーク上のクラウド環境に構築されているものとしても良い。 The program information recognition server 20, the intention determination server 30, the storage server 40, and the voice assistant server 50 of the present embodiment include a control device such as a CPU, a storage device such as a ROM (Read Only Memory), and a RAM. It is equipped with an external storage device such as an HDD and a CD drive device, and has a hardware configuration using a normal computer. Further, the program information recognition server 20, the intention determination server 30, the storage server 40, and the voice assistant server 50 of the present embodiment may be constructed in a cloud environment on a network, for example.

また、情報処理システムＳ１は複数のテレビジョン装置１０を含むものであっても良い。この場合、番組情報認識サーバ２０と、意図判断サーバ３０と、記憶サーバ４０と、音声アシスタントサーバ５０とは、複数のテレビジョン装置１０と接続して情報を送受信するものとする。 Further, the information processing system S1 may include a plurality of television devices 10. In this case, the program information recognition server 20, the intention determination server 30, the storage server 40, and the voice assistant server 50 are connected to a plurality of television devices 10 to transmit and receive information.

次に、本実施形態の情報処理システムＳ１に含まれる各装置の詳細について説明する。
図２は、本実施形態にかかるテレビジョン装置１０のハードウェア構成の一例を示す図である。図２に示すように、テレビジョン装置１０は、アンテナ１０１と、入力端子１０２ａと、チューナ１０３と、デモジュレータ１０４と、デマルチプレクサ１０５と、入力端子１０２ｂおよび１０２ｃと、Ａ／Ｄ（アナログ／デジタル）変換器１０６と、セレクタ１０７と、信号処理部１０８と、スピーカ１０９と、表示パネル１１０と、操作部１１１と、受光部１１２と、ＩＰ通信部１１３と、ＣＰＵ１１４と、メモリ１１５と、ストレージ１１６と、マイク１１７と、オーディオＩ／Ｆ（インターフェース）１１８とを備える。 Next, the details of each device included in the information processing system S1 of the present embodiment will be described.
FIG. 2 is a diagram showing an example of the hardware configuration of the television device 10 according to the present embodiment. As shown in FIG. 2, the television apparatus 10 includes an antenna 101, an input terminal 102a, a tuner 103, a demodulator 104, a demultiplexer 105, input terminals 102b and 102c, and an A / D (analog / digital). ) Converter 106, selector 107, signal processing unit 108, speaker 109, display panel 110, operation unit 111, light receiving unit 112, IP communication unit 113, CPU 114, memory 115, and storage 116. And a microphone 117 and an audio I / F (interface) 118.

アンテナ１０１は、デジタル放送の放送信号を受信し、受信した放送信号を、入力端子１０２ａを介してチューナ１０３に供給する。チューナ１０３は、アンテナ１０１から供給された放送信号から所望のチャンネルの放送信号を選局し、選局した放送信号をデモジュレータ１０４に供給する。 The antenna 101 receives a broadcast signal of digital broadcasting, and supplies the received broadcast signal to the tuner 103 via the input terminal 102a. The tuner 103 selects a broadcast signal of a desired channel from the broadcast signal supplied from the antenna 101, and supplies the selected broadcast signal to the demodulator 104.

デモジュレータ１０４は、チューナ１０３から供給された放送信号を復調し、復調した放送信号をデマルチプレクサ１０５に供給する。デマルチプレクサ１０５は、デモジュレータ１０４から供給された放送信号を分離して映像信号および音声信号を生成し、生成した映像信号および音声信号を後述するセレクタ１０７に供給する。 The demodulator 104 demodulates the broadcast signal supplied from the tuner 103, and supplies the demodulated broadcast signal to the demultiplexer 105. The demultiplexer 105 separates the broadcast signal supplied from the demodulator 104 to generate a video signal and an audio signal, and supplies the generated video signal and the audio signal to the selector 107 described later.

入力端子１０２ｂは、外部から入力されるアナログ信号（映像信号および音声信号）を受け付ける。また、入力端子１０２ｃは、外部から入力されるデジタル信号（映像信号および音声信号）を受け付けるように構成されている。例えば、入力端子１０２ｃは、ブルーレイディスクなどの録画再生用の記録媒体を駆動して録画および再生するドライブ装置を搭載したレコーダ（ＢＤレコーダ）等から、デジタル信号の入力が可能であるものとする。Ａ／Ｄ変換器１０６は、入力端子１０２ｃから供給されるアナログ信号にＡ／Ｄ変換を施すことにより生成したデジタル信号をセレクタ１０７に供給する。 The input terminal 102b receives an analog signal (video signal and audio signal) input from the outside. Further, the input terminal 102c is configured to receive digital signals (video signal and audio signal) input from the outside. For example, it is assumed that the input terminal 102c can input a digital signal from a recorder (BD recorder) or the like equipped with a drive device for driving a recording medium for recording / playback such as a Blu-ray disc to record and play back. The A / D converter 106 supplies the digital signal generated by performing A / D conversion to the analog signal supplied from the input terminal 102c to the selector 107.

操作部１１１は、ユーザの操作入力を受け付ける。また、受光部１１２は、リモートコントローラ１１９からの赤外線を受光する。ＩＰ通信部１１３は、ネットワーク３００を介したＩＰ（インターネットプロトコル）通信を行うための通信インターフェースである。ＩＰ通信部１１３は、ネットワーク３００を介して、番組情報認識サーバ２０、意図判断サーバ３０、記憶サーバ４０、音声アシスタントサーバ５０と通信可能であるものとする。 The operation unit 111 receives the user's operation input. Further, the light receiving unit 112 receives infrared rays from the remote controller 119. The IP communication unit 113 is a communication interface for performing IP (Internet Protocol) communication via the network 300. It is assumed that the IP communication unit 113 can communicate with the program information recognition server 20, the intention determination server 30, the storage server 40, and the voice assistant server 50 via the network 300.

ＣＰＵ１１４は、テレビジョン装置１０全体を制御する制御部である。メモリ１１５は、ＣＰＵ１１４が実行する各種コンピュータプログラムを格納するＲＯＭや、ＣＰＵ１１４に作業エリアを提供するＲＡＭ等である。また、ストレージ１１６は、ＨＤＤ（ハードディスクドライブ）やＳＳＤ（ソリッドステートドライブ）等である。ストレージ１１６は、例えば、セレクタ１０７により選択された信号を録画データとして記録する。 The CPU 114 is a control unit that controls the entire television device 10. The memory 115 is a ROM for storing various computer programs executed by the CPU 114, a RAM for providing a work area for the CPU 114, and the like. The storage 116 is an HDD (hard disk drive), SSD (solid state drive), or the like. The storage 116 records, for example, the signal selected by the selector 107 as recorded data.

マイク１１７は、ユーザが発話した音声を取得して、オーディオＩ／Ｆ１１８に送出する。オーディオＩ／Ｆ１１８は、マイク１１７が取得した音声をアナログ／デジタル変換して、音声信号としてＣＰＵ１１４に送出する。 The microphone 117 acquires the voice spoken by the user and sends it to the audio I / F 118. The audio I / F 118 converts the voice acquired by the microphone 117 into analog / digital and sends it to the CPU 114 as a voice signal.

次に、本実施形態にかかるテレビジョン装置１０の機能について説明する。
図３は、本実施形態にかかるテレビジョン装置１０が有する機能の一例を示す図である。図３に示すように、テレビジョン装置１０は、音声入力部１１と、ウェイクワード判断部１２と、第１の送信部１３と、第１の受信部１４と、第２の送信部１５と、第２の受信部１６と、再生部１７と、録画部１８とを備える。 Next, the function of the television apparatus 10 according to the present embodiment will be described.
FIG. 3 is a diagram showing an example of a function of the television device 10 according to the present embodiment. As shown in FIG. 3, the television apparatus 10 includes an audio input unit 11, a wake word determination unit 12, a first transmission unit 13, a first reception unit 14, and a second transmission unit 15. A second receiving unit 16, a reproducing unit 17, and a recording unit 18 are provided.

音声入力部１１は、ユーザによって発話された音声を、音声信号として入力（取得）する。より詳細には、音声入力部１１は、オーディオＩ／Ｆ１１８から、ユーザが発話した音声がデジタル変換された音声信号の入力を受ける。音声入力部１１は、取得した音声信号(音声)をウェイクワード判断部１２に送出する。 The voice input unit 11 inputs (acquires) the voice spoken by the user as a voice signal. More specifically, the voice input unit 11 receives an input of a voice signal obtained by digitally converting the voice spoken by the user from the audio I / F 118. The voice input unit 11 sends the acquired voice signal (voice) to the wake word determination unit 12.

ウェイクワード判断部１２は、音声入力部１１によって取得された音声信号が、所定のウェイクワードを含むか否かを判断する。ウェイクワードは、音声アシスタント機能の起動のトリガとなる所定の音声コマンドであり、インボケ―ションワードともいう。ウェイクワードは予め定められているものとする。音声信号がウェイクワードを含むか否かを判断する手法は、公知の音声認識の技術を採用することができる。また、ウェイクワード判断部１２は、音声入力部１１によって取得された音声信号が所定のウェイクワードを含むと判断した場合に、取得された音声信号のうち、所定のウェイクワードの後に続く音声信号を第１の送信部１３に送出する。 The wake word determination unit 12 determines whether or not the voice signal acquired by the voice input unit 11 includes a predetermined wake word. The wake word is a predetermined voice command that triggers the activation of the voice assistant function, and is also called an invocation word. Wake words shall be predetermined. As a method for determining whether or not the voice signal includes a wake word, a known voice recognition technique can be adopted. Further, when the wake word determination unit 12 determines that the audio signal acquired by the audio input unit 11 includes a predetermined wake word, the wake word determination unit 12 selects the audio signal following the predetermined wake word among the acquired audio signals. It is transmitted to the first transmission unit 13.

第１の送信部１３は、所定のウェイクワードの後に続く音声信号に、テレビジョン装置１０を特定可能な識別情報と、音声信号を特定可能な識別情報とを対応付けた音声情報を、番組情報認識サーバ２０と、音声アシスタントサーバ５０とに送信する。 The first transmission unit 13 provides audio information in which the identification information that can identify the television device 10 and the identification information that can identify the audio signal are associated with the audio signal that follows the predetermined wake word. It is transmitted to the recognition server 20 and the voice assistant server 50.

図４は、本実施形態にかかるテレビジョン装置１０から送信される音声情報８１の一例を示す図である。図４に示すように、音声情報８１は、テレビジョン装置１０を特定可能な識別情報（テレビジョン装置ＩＤ）と、音声信号を特定可能な識別情報（音声ＩＤ）と、音声信号とが対応付けられた情報である。 FIG. 4 is a diagram showing an example of audio information 81 transmitted from the television device 10 according to the present embodiment. As shown in FIG. 4, in the audio information 81, the identification information (television device ID) that can identify the television device 10, the identification information (audio ID) that can identify the audio signal, and the audio signal are associated with each other. This is the information that was given.

図３に戻り、第１の受信部１４は、番組情報認識サーバ２０から、テレビジョン装置ＩＤと、音声ＩＤと、第２の音声認識データと、番組情報の特定結果とを受信する。第１の受信部１４は、受信した情報を第２の送信部１５に送出する。 Returning to FIG. 3, the first receiving unit 14 receives the television device ID, the voice ID, the second voice recognition data, and the specific result of the program information from the program information recognition server 20. The first receiving unit 14 transmits the received information to the second transmitting unit 15.

第２の送信部１５は、第１の受信部１４が受信したテレビジョン装置ＩＤと、音声ＩＤと、第２の音声認識データと、番組情報の特定結果とを記憶サーバ４０に送信する。なお、第２の送信部１５は、番組情報の特定結果に基づいて、第２の音声認識データに含まれる第２の番組情報を特定し、テレビジョン装置ＩＤと、音声ＩＤと、第２の番組情報とを記憶サーバ４０に送信しても良い。 The second transmission unit 15 transmits the television device ID received by the first reception unit 14, the voice ID, the second voice recognition data, and the specific result of the program information to the storage server 40. The second transmission unit 15 specifies the second program information included in the second voice recognition data based on the specific result of the program information, and the television device ID, the voice ID, and the second. The program information may be transmitted to the storage server 40.

第２の受信部１６は、番組情報認識サーバ２０によって特定された番組情報に関する動作を指示する指示信号を、意図判断サーバ３０から受信する。第２の受信部１６は、意図判断サーバ３０から受信した指示信号を、再生部１７および録画部１８に送出する。 The second receiving unit 16 receives an instruction signal instructing an operation related to the program information specified by the program information recognition server 20 from the intention determination server 30. The second receiving unit 16 sends the instruction signal received from the intention determination server 30 to the reproducing unit 17 and the recording unit 18.

再生部１７は、第２の受信部１６が受信した指示信号に基づいて、ストレージ１１６または外部の記憶装置に保存された番組の録画データを再生する。例えば、再生部１７は、指示信号によって指定された番組タイトルの録画データをストレージ１１６または外部の記憶装置から検索し、当該録画データを再生する。また、指示信号が録画データではなく放送中の番組を表示することを指示している場合、再生部１７は、チューナ１０３を制御して、指示信号によって指定された番組が放送されているチャンネルを選曲し、当該番組を表示パネル１１０に表示しても良い。 The reproduction unit 17 reproduces the recorded data of the program stored in the storage 116 or the external storage device based on the instruction signal received by the second reception unit 16. For example, the reproduction unit 17 searches for the recorded data of the program title designated by the instruction signal from the storage 116 or an external storage device, and reproduces the recorded data. Further, when the instruction signal indicates that the program being broadcast is displayed instead of the recorded data, the playback unit 17 controls the tuner 103 to select the channel on which the program designated by the instruction signal is broadcast. A song may be selected and the program may be displayed on the display panel 110.

録画部１８は、第２の受信部１６が受信した指示信号に基づいて、セレクタ１０７を制御して録画対象の番組を選択し、当該番組をストレージ１１６または外部の記憶装置に保存（録画）する。 The recording unit 18 controls the selector 107 based on the instruction signal received by the second receiving unit 16 to select a program to be recorded, and stores (records) the program in the storage 116 or an external storage device. ..

次に、本実施形態の番組情報認識サーバ２０の機能について説明する。
図５は、本実施形態にかかる番組情報認識サーバ２０が有する機能の一例を示す図である。図５に示すように、番組情報認識サーバ２０は、受信部２１と、特定部２２と、出力部２３と、記憶部２５とを備える。 Next, the function of the program information recognition server 20 of this embodiment will be described.
FIG. 5 is a diagram showing an example of a function of the program information recognition server 20 according to the present embodiment. As shown in FIG. 5, the program information recognition server 20 includes a receiving unit 21, a specifying unit 22, an output unit 23, and a storage unit 25.

記憶部２５には、番組情報が登録された辞書が予め保存される。また、記憶部２５は、例えばＨＤＤ等の記憶装置である。 A dictionary in which program information is registered is stored in advance in the storage unit 25. Further, the storage unit 25 is a storage device such as an HDD.

図６は、本実施形態にかかる辞書８０の一例を示す図である。図６に示すように、辞書８０には、番組のタイトルの文字データと、番組のタイトルの発音を示す情報（読み仮名）とが対応付けられて登録される。一般に、番組のタイトルには符号や当て字等が用いられる場合があるため、番組のタイトルの発音は、通常の読み方とは異なる場合があるが、辞書８０には、番組のタイトルの正しい発音が予め登録されているものとする。なお、番組のタイトルの文字データの代わりに、番組を識別可能なＩＤ等の識別情報が辞書８０に登録されるものとしても良い。また、辞書８０には、さらに、番組の放送時刻等の、番組に関する各種のメタデータが登録されても良い。 FIG. 6 is a diagram showing an example of the dictionary 80 according to the present embodiment. As shown in FIG. 6, the character data of the program title and the information (phonetic name) indicating the pronunciation of the program title are registered in the dictionary 80 in association with each other. In general, since a code, a letter, or the like may be used in the title of a program, the pronunciation of the title of the program may be different from the normal reading, but the dictionary 80 has the correct pronunciation of the title of the program in advance. It shall be registered. Instead of the character data of the title of the program, identification information such as an ID that can identify the program may be registered in the dictionary 80. Further, various metadata related to the program such as the broadcast time of the program may be registered in the dictionary 80.

図５に戻り、受信部２１は、テレビジョン装置１０から、音声信号を受信する。より詳細には、受信部２１は、音声信号と、テレビジョン装置１０を特定可能な識別情報（テレビジョン装置ＩＤ）と、音声信号を特定可能な識別情報（音声ＩＤ）とを対応付けた音声情報８１を、テレビジョン装置１０から受信する。受信部２１は、受信した音声情報８１を、特定部２２に送出する。 Returning to FIG. 5, the receiving unit 21 receives an audio signal from the television device 10. More specifically, the receiving unit 21 associates the voice signal with the identification information (television device ID) that can identify the television device 10 and the identification information (voice ID) that can identify the voice signal. Information 81 is received from the television device 10. The receiving unit 21 sends the received voice information 81 to the specific unit 22.

特定部２２は、辞書８０を用いた音声認識によって、第２の音声認識データを生成する。具体的には、特定部２２は、受信部２１が受信した音声情報８１に含まれる音声信号を音声認識によって文字データに変換する。また、特定部２２は、当該文字データに含まれる番組タイトルを、記憶部２５に記憶された辞書８０に基づいて特定する。例えば、特定部２２は、第２の音声認識データの中に辞書８０に登録された番組タイトルの発音と一致する箇所がある場合に、当該個所を番組タイトルとして特定する。 The specific unit 22 generates the second voice recognition data by voice recognition using the dictionary 80. Specifically, the specifying unit 22 converts the voice signal included in the voice information 81 received by the receiving unit 21 into character data by voice recognition. Further, the specifying unit 22 specifies the program title included in the character data based on the dictionary 80 stored in the storage unit 25. For example, when the second voice recognition data has a part that matches the pronunciation of the program title registered in the dictionary 80, the specifying unit 22 specifies the part as the program title.

図７は、本実施形態にかかる音声認識結果の一例を示す図である。図７に示すように、ユーザ９がテレビジョン装置１０に対して音声９０を入力した場合に、テレビジョン装置１０は、音声９０を音声信号として音声アシスタントサーバ５０と番組情報認識サーバ２０とに送信する。番組情報認識サーバ２０の特定部２２は、第２の音声認識データ９２の生成の際に、文字データ中の番組タイトルとして特定した箇所を、辞書８０に登録された番組タイトルの文字データに変換する。上述のように、辞書８０には、番組のタイトルの文字データと、番組のタイトルの発音を示す情報とが対応付けられて登録されているため、特定部２２は、番組タイトルの読み方が一般的な読み方と異なっていた場合でも、第２の音声認識データ９２に含まれる番組タイトルを高精度に特定することができる。なお、特定部２２による音声認識および番組情報の特定の手法はこれに限定されるものではなく、他の公知の手法を採用しても良い。 FIG. 7 is a diagram showing an example of the voice recognition result according to the present embodiment. As shown in FIG. 7, when the user 9 inputs the voice 90 to the television device 10, the television device 10 transmits the voice 90 as a voice signal to the voice assistant server 50 and the program information recognition server 20. do. When the second voice recognition data 92 is generated, the specifying unit 22 of the program information recognition server 20 converts the portion specified as the program title in the character data into the character data of the program title registered in the dictionary 80. .. As described above, since the character data of the program title and the information indicating the pronunciation of the program title are registered in the dictionary 80 in association with each other, the specific unit 22 generally reads the program title. Even if the reading is different, the program title included in the second voice recognition data 92 can be specified with high accuracy. The method of voice recognition by the specific unit 22 and the specific method of program information is not limited to this, and other known methods may be adopted.

また、特定部２２は、第２の音声認識データ９２と、第２の音声認識データ９２のうちの番組情報に該当する箇所を特定する情報（番組情報の特定結果）とを対応付けて、出力部２３に送出する。 Further, the specifying unit 22 outputs the second voice recognition data 92 in association with the information for specifying the portion corresponding to the program information (specific result of the program information) in the second voice recognition data 92. It is sent to the unit 23.

図５に戻り、出力部２３は、特定部２２によって特定された第２の番組情報を出力する。より詳細には、出力部２３は、テレビジョン装置ＩＤと、音声ＩＤと、第２の音声認識データ９２と、番組情報の特定結果とを対応付けて、テレビジョン装置１０に出力する。 Returning to FIG. 5, the output unit 23 outputs the second program information specified by the specific unit 22. More specifically, the output unit 23 associates the television device ID, the voice ID, the second voice recognition data 92, and the specific result of the program information, and outputs the television device 10.

次に、本実施形態の記憶サーバ４０の機能について説明する。
図８は、本実施形態にかかる記憶サーバ４０が有する機能の一例を示す図である。図８に示すように、記憶サーバ４０は、保存処理部４１と、検索部４２と、記憶部４５とを備える。 Next, the function of the storage server 40 of this embodiment will be described.
FIG. 8 is a diagram showing an example of the function of the storage server 40 according to the present embodiment. As shown in FIG. 8, the storage server 40 includes a storage processing unit 41, a search unit 42, and a storage unit 45.

保存処理部４１は、テレビジョン装置１０から出力されたテレビジョン装置ＩＤと、音声ＩＤと、第２の音声認識データ９２と、番組情報の特定結果とを受信し、記憶部４５に保存する。 The storage processing unit 41 receives the television device ID, the voice ID, the second voice recognition data 92, and the specific result of the program information output from the television device 10, and stores them in the storage unit 45.

記憶部４５は、テレビジョン装置１０から出力されたテレビジョン装置ＩＤと、音声ＩＤと、第２の音声認識データ９２と、番組情報の特定結果とを対応付けて記憶する。記憶部４５は、例えばＨＤＤ等の記憶装置である。 The storage unit 45 stores the television device ID output from the television device 10, the voice ID, the second voice recognition data 92, and the specific result of the program information in association with each other. The storage unit 45 is a storage device such as an HDD.

検索部４２は、意図判断サーバ３０から第２の音声認識データ９２に含まれる第２の番組情報の送信要求を受けた場合に、意図判断サーバ３０から送信されたテレビジョン装置ＩＤと、音声ＩＤとに対応付けられた第２の音声認識データ９２を、記憶部４５から検索し、意図判断サーバ３０に対して送信する。なお、検索部４２は、第２の音声認識データ９２全体ではなく、第２の音声認識データ９２のうちの第２の番組情報に該当する箇所を意図判断サーバ３０に送信しても良い。 When the search unit 42 receives a transmission request for the second program information included in the second voice recognition data 92 from the intention determination server 30, the television device ID and the voice ID transmitted from the intention determination server 30. The second voice recognition data 92 associated with the above is searched from the storage unit 45 and transmitted to the intention determination server 30. The search unit 42 may transmit the portion corresponding to the second program information in the second voice recognition data 92 to the intention determination server 30 instead of the entire second voice recognition data 92.

次に、本実施形態の意図判断サーバ３０の機能について説明する。
図９は、本実施形態にかかる意図判断サーバ３０が有する機能の一例を示す図である。図９に示すように、意図判断サーバ３０は、取得部３１と、判断部３２と、置換部３３と、映像装置制御部３４と、記憶部３５とを備える。 Next, the function of the intention determination server 30 of this embodiment will be described.
FIG. 9 is a diagram showing an example of a function of the intention determination server 30 according to the present embodiment. As shown in FIG. 9, the intention determination server 30 includes an acquisition unit 31, a determination unit 32, a replacement unit 33, a video device control unit 34, and a storage unit 35.

記憶部３５には、テレビジョン装置１０に対する命令に使用される所定のコマンドが予め保存される。所定のコマンドは、テレビジョン装置１０の動作を指定する命令であり、例えば、「再生して」、「録画して」、「つけて」等であるが、これらに限定されるものではない。また、記憶部３５は、例えばＨＤＤ等の記憶装置である。 A predetermined command used for a command to the television device 10 is stored in the storage unit 35 in advance. The predetermined command is a command for designating the operation of the television device 10, and is, for example, "playing", "recording", "turning on", and the like, but is not limited thereto. Further, the storage unit 35 is a storage device such as an HDD.

取得部３１は、音声アシスタントサーバ５０から出力された第１の音声認識データと、構文解析結果と、テレビジョン装置ＩＤと、音声ＩＤとを、を取得する。 The acquisition unit 31 acquires the first voice recognition data output from the voice assistant server 50, the syntax analysis result, the television device ID, and the voice ID.

ここで、第１の音声認識データについて説明する。
上述の図７に示すように、テレビジョン装置１０は、ユーザ９が発した音声９０を音声信号として音声アシスタントサーバ５０に送信する。音声アシスタントサーバ５０は、受信した音声信号を音声認識によって文字データに変換することにより、第１の音声認識データ９１を生成する。また、音声アシスタントサーバ５０は、生成した第１の音声認識データ９１の構文解析を行う。 Here, the first voice recognition data will be described.
As shown in FIG. 7 above, the television apparatus 10 transmits the voice 90 emitted by the user 9 to the voice assistant server 50 as a voice signal. The voice assistant server 50 generates the first voice recognition data 91 by converting the received voice signal into character data by voice recognition. Further, the voice assistant server 50 performs a syntactic analysis of the generated first voice recognition data 91.

構文解析結果は、第１の音声認識データ９１に含まれる目的語または述語を特定する情報とする。例えば、音声アシスタントサーバ５０は、第１の音声認識データ９１に含まれる文章のうち、目的語に該当する文字の範囲と、動詞等を含む述語に該当する文字の範囲とを構文解析によって特定する。構文解析の手法は、公知の手法を採用することができる。音声アシスタントサーバ５０は、テレビジョン装置ＩＤと、音声ＩＤと、生成した第１の音声認識データ９１と構文解析結果とを対応付けて、意図判断サーバ３０に送信する。 The syntactic analysis result is information that identifies an object or a predicate included in the first speech recognition data 91. For example, the voice assistant server 50 specifies a range of characters corresponding to an object and a range of characters corresponding to a predicate including a verb among sentences included in the first voice recognition data 91 by parsing. .. As the method of parsing, a known method can be adopted. The voice assistant server 50 associates the television device ID, the voice ID, the generated first voice recognition data 91, and the syntactic analysis result, and transmits the result to the intention determination server 30.

図９に戻り、判断部３２は、構文解析結果に基づいて、第１の音声認識データ９１が番組情報を含むか否かを判断する。本実施形態においては、判断部３２は、第１の音声認識データ９１が番組タイトルを含むか否かを判断する。例えば、判断部３２は、第１の音声認識データ９１のうち構文解析によって「述語」と特定された箇所が、記憶部３５に保存された所定のコマンドのいずれかを含む場合に、第１の音声認識データ９１のうち「目的語」と特定された箇所が番組タイトルであると判断する。なお、第１の音声認識データ９１が番組情報を含むか否かの判断の手法はこれに限定されるものではなく、他の公知の解析手法を採用可能である。 Returning to FIG. 9, the determination unit 32 determines whether or not the first voice recognition data 91 includes program information based on the syntax analysis result. In the present embodiment, the determination unit 32 determines whether or not the first voice recognition data 91 includes the program title. For example, the determination unit 32 is the first when the part of the first speech recognition data 91 specified as the "predicate" by the syntactic analysis includes any of the predetermined commands stored in the storage unit 35. It is determined that the portion of the voice recognition data 91 specified as the "object" is the program title. The method for determining whether or not the first voice recognition data 91 includes program information is not limited to this, and other known analysis methods can be adopted.

置換部３３は、判断部３２によって第１の音声認識データ９１が番組情報（第１の番組情報）を含むと判断された場合に、第２の音声認識データに含まれる第２の番組情報を記憶サーバ４０から取得し、第１の音声認識データ９１に含まれる第１の番組情報を、第２の番組情報に置換する。例えば、図７に示した例では、置換部３３は、第１の音声認識データ９１の「楽しいトークを再生して」の「楽しいトーク」を、第２の音声認識データ９２の「楽しい☆トーーーク！！」に置換する。置換部３３は、当該置換処理後の第３の音声認識データを、映像装置制御部３４に送出する。 When the determination unit 32 determines that the first voice recognition data 91 includes program information (first program information), the replacement unit 33 uses the second program information included in the second voice recognition data. The first program information acquired from the storage server 40 and included in the first voice recognition data 91 is replaced with the second program information. For example, in the example shown in FIG. 7, the replacement unit 33 performs the "fun talk" of "playing a fun talk" of the first voice recognition data 91 and the "fun ☆ talk" of the second voice recognition data 92. Replace with "!!". The replacement unit 33 sends the third voice recognition data after the replacement process to the video apparatus control unit 34.

映像装置制御部３４は、第３の音声認識データに基づいて、テレビジョン装置１０に対して指示信号を送信してテレビジョン装置１０の動作を制御する。例えば、映像装置制御部３４は、第３の音声認識データに含まれる番組タイトルと、コマンドとを信号に変換して、指示信号としてテレビジョン装置１０に送信する。 The video device control unit 34 controls the operation of the television device 10 by transmitting an instruction signal to the television device 10 based on the third voice recognition data. For example, the video apparatus control unit 34 converts the program title and the command included in the third voice recognition data into a signal and transmits the instruction signal to the television apparatus 10.

次に、本実施形態における音声認識処理の流れについて説明する。
図１０は、本実施形態にかかる音声認識処理の流れの一例を示すシーケンス図である。テレビジョン装置１０の音声入力部１１は、ユーザ９が発話した音声９０の入力を受ける（Ｓ１）。音声入力部１１は、入力された音声９０をウェイクワード判断部１２に送出する。 Next, the flow of the voice recognition process in this embodiment will be described.
FIG. 10 is a sequence diagram showing an example of the flow of the voice recognition process according to the present embodiment. The voice input unit 11 of the television device 10 receives the input of the voice 90 spoken by the user 9 (S1). The voice input unit 11 sends the input voice 90 to the wake word determination unit 12.

次に、ウェイクワード判断部１２は、入力された音声９０に所定のウェイクワードが含まれているか否かを判断する（Ｓ２）。ウェイクワード判断部１２は、音声入力部１１によって取得された音声９０が所定のウェイクワードを含むと判断した場合に、取得された音声９０のうち、所定のウェイクワードの後に続く音声を第１の送信部１３に送出する。また、ウェイクワード判断部１２は、音声９０が所定のウェイクワードを含まないと判断した場合は、第１の送信部１３に音声を送出しない。 Next, the wake word determination unit 12 determines whether or not the input voice 90 includes a predetermined wake word (S2). When the wake word determination unit 12 determines that the voice 90 acquired by the voice input unit 11 includes a predetermined wake word, the voice 90 that follows the predetermined wake word among the acquired voice 90 is the first voice. It is transmitted to the transmission unit 13. Further, when the wake word determination unit 12 determines that the voice 90 does not include the predetermined wake word, the wake word determination unit 12 does not transmit the voice to the first transmission unit 13.

次に、第１の送信部１３は、入力された音声９０のうち所定のウェイクワードよりも後の音声を、音声信号として、番組情報認識サーバ２０と、音声アシスタントサーバ５０とに送信する。より詳細には、第１の送信部１３は、音声信号と、テレビジョン装置ＩＤと、音声ＩＤとを対応付けた音声情報８１を、番組情報認識サーバ２０と、音声アシスタントサーバ５０とに送信する（Ｓ３）。 Next, the first transmission unit 13 transmits the voice after the predetermined wake word among the input voice 90 to the program information recognition server 20 and the voice assistant server 50 as a voice signal. More specifically, the first transmission unit 13 transmits the voice information 81 in which the voice signal, the television device ID, and the voice ID are associated with the program information recognition server 20 and the voice assistant server 50. (S3).

そして、番組情報認識サーバ２０の受信部２１は、テレビジョン装置１０から音声情報８１を受信する。受信部２１は、受信した音声情報８１を、特定部２２に送出する。番組情報認識サーバ２０の特定部２２は、受信部２１が受信した音声情報８１に含まれる音声信号を音声認識によって文字データに変換し、当該文字データに含まれる番組タイトルを、辞書８０に基づいて特定する（Ｓ４）。 Then, the receiving unit 21 of the program information recognition server 20 receives the audio information 81 from the television device 10. The receiving unit 21 sends the received voice information 81 to the specific unit 22. The specific unit 22 of the program information recognition server 20 converts the voice signal included in the voice information 81 received by the reception unit 21 into character data by voice recognition, and the program title included in the character data is based on the dictionary 80. Specify (S4).

次に、番組情報認識サーバ２０の出力部２３は、テレビジョン装置ＩＤと、音声ＩＤと、第２の音声認識データ９２と、番組情報の特定結果とを対応付けて、音声認識結果としてテレビジョン装置１０に出力する（Ｓ５）。 Next, the output unit 23 of the program information recognition server 20 associates the television device ID, the voice ID, the second voice recognition data 92, and the specific result of the program information with the television as the voice recognition result. Output to device 10 (S5).

そして、テレビジョン装置１０の第１の受信部１４は、音声信号の音声認識結果として、テレビジョン装置ＩＤと、音声ＩＤと、第２の音声認識データ９２と、番組情報の特定結果とを受信する。次に、テレビジョン装置１０の第２の送信部１５は、第１の受信部１４が受信した音声認識結果、つまり、テレビジョン装置ＩＤと、音声ＩＤと、第２の音声認識データ９２と、番組情報の特定結果とを、対応付けて記憶サーバ４０に送信する（Ｓ６）。 Then, the first receiving unit 14 of the television device 10 receives the television device ID, the voice ID, the second voice recognition data 92, and the specific result of the program information as the voice recognition result of the voice signal. do. Next, the second transmission unit 15 of the television device 10 receives the voice recognition result received by the first reception unit 14, that is, the television device ID, the voice ID, the second voice recognition data 92, and the second voice recognition data 92. The specific result of the program information is associated with the storage server 40 and transmitted to the storage server 40 (S6).

そして、記憶サーバ４０の保存処理部４１は、テレビジョン装置１０から受信した番組情報認識サーバ２０による音声認識結果を、記憶部４５に保存する（Ｓ７）。 Then, the storage processing unit 41 of the storage server 40 stores the voice recognition result by the program information recognition server 20 received from the television device 10 in the storage unit 45 (S7).

また、音声アシスタントサーバ５０は、Ｓ３の処理でテレビジョン装置１０から送信された音声情報８１に含まれる音声信号を音声認識して文字データに変換し、当該文字データの構文解析を行う（Ｓ８）。音声アシスタントサーバ５０は、テレビジョン装置ＩＤと、音声ＩＤと、第１の音声認識データ９１と、構文解析結果とを対応付けて、音声認識結果として意図判断サーバ３０に送信する（Ｓ９）。 Further, the voice assistant server 50 recognizes the voice signal included in the voice information 81 transmitted from the television apparatus 10 in the process of S3, converts it into character data, and performs parsing of the character data (S8). .. The voice assistant server 50 associates the television device ID, the voice ID, the first voice recognition data 91, and the syntax analysis result, and transmits the voice recognition result to the intention determination server 30 (S9).

そして、意図判断サーバ３０の取得部３１は、音声アシスタントサーバ５０から出力されたテレビジョン装置ＩＤと、音声ＩＤと、第１の音声認識データ９１と、構文解析結果とを取得する。そして、意図判断サーバ３０の判断部３２は、取得された構文解析結果に基づいて、第１の音声認識データ９１が番組タイトルを含むか否かを判断する（Ｓ１０）。 Then, the acquisition unit 31 of the intention determination server 30 acquires the television device ID, the voice ID, the first voice recognition data 91, and the parsing result output from the voice assistant server 50. Then, the determination unit 32 of the intention determination server 30 determines whether or not the first voice recognition data 91 includes the program title based on the acquired syntax analysis result (S10).

Ｓ１０の処理において判断部３２によって第１の音声認識データ９１が番組情報（第１の番組情報）を含むと判断された場合に、分岐Ｓ１００の処理が実行される。具体的には、意図判断サーバ３０の置換部３３は、判断部３２によって第１の音声認識データ９１が番組情報（第１の番組情報）を含むと判断された場合に、記憶サーバ４０に対して、第２の音声認識データ９２に含まれる第２の番組情報の送信要求をする（Ｓ１１）。より詳細には、置換部３３は、判断部３２によって第１の番組情報を含むと判断された第１の音声認識データ９１に対応付けられたテレビジョン装置ＩＤと、音声ＩＤとを、記憶サーバ４０に対して送信する。 When the determination unit 32 determines in the process of S10 that the first voice recognition data 91 includes program information (first program information), the process of branch S100 is executed. Specifically, the replacement unit 33 of the intention determination server 30 refers to the storage server 40 when the determination unit 32 determines that the first voice recognition data 91 includes program information (first program information). Then, the transmission request of the second program information included in the second voice recognition data 92 is made (S11). More specifically, the replacement unit 33 stores the television device ID associated with the first voice recognition data 91 determined by the determination unit 32 to include the first program information, and the voice ID in the storage server. Send to 40.

そして、記憶サーバ４０の検索部４２は、意図判断サーバ３０から送信されたテレビジョン装置ＩＤと、音声ＩＤとに対応付けられた第２の音声認識データ９２を、記憶部４５から検索し、意図判断サーバ３０に対して送信する（Ｓ１２）。なお、第２の音声認識データ９２の検索処理は、意図判断サーバ３０の置換部３３が実行するものとしても良い。また、置換部３３は、第２の番組情報を含む第２の音声認識データ９２全体を記憶サーバ４０から取得しても良いし、第２の音声認識データ９２のうちの第２の番組情報のみを
取得しても良い。 Then, the search unit 42 of the storage server 40 searches the storage unit 45 for the television device ID transmitted from the intention determination server 30 and the second voice recognition data 92 associated with the voice ID, and intends to use the storage server 40. It is transmitted to the determination server 30 (S12). The search process for the second voice recognition data 92 may be executed by the replacement unit 33 of the intention determination server 30. Further, the replacement unit 33 may acquire the entire second voice recognition data 92 including the second program information from the storage server 40, or only the second program information of the second voice recognition data 92. May be obtained.

意図判断サーバ３０の置換部３３は、第１の音声認識データ９１に含まれる第１の番組情報を、記憶サーバ４０から取得した第２の番組情報に置換する（Ｓ１３）。 The replacement unit 33 of the intention determination server 30 replaces the first program information included in the first voice recognition data 91 with the second program information acquired from the storage server 40 (S13).

そして、意図判断サーバ３０の映像装置制御部３４は、置換部３３による置換処理がお行われた第３の音声認識データに基づいて、テレビジョン装置１０に対して動作を指示する指示信号を送信する（Ｓ１４）。 Then, the video device control unit 34 of the intention determination server 30 transmits an instruction signal instructing the operation to the television device 10 based on the third voice recognition data that has been replaced by the replacement unit 33. (S14).

そして、テレビジョン装置１０の第２の受信部１６は、意図判断サーバ３０から受信した指示信号を、再生部１７または録画部１８に送出する。そして、再生部１７または録画部１８は、意図判断サーバ３０から送信された指示信号に従って、処理を実行する（Ｓ１５）。例えば、再生部１７または録画部１８は、指示信号によって指定された番組の録画データの再生や、指示信号によって指定された番組の録画等を実行する。 Then, the second receiving unit 16 of the television device 10 sends the instruction signal received from the intention determination server 30 to the reproducing unit 17 or the recording unit 18. Then, the reproduction unit 17 or the recording unit 18 executes the process according to the instruction signal transmitted from the intention determination server 30 (S15). For example, the reproduction unit 17 or the recording unit 18 executes reproduction of recorded data of a program designated by an instruction signal, recording of a program designated by an instruction signal, and the like.

なお、再生または録画の対象となる番組の候補が複数存在する場合、再生部１７または録画部１８は、表示パネル１１０に候補となる番組を選択可能に表示しても良い。例えば、指示信号によって指定されたタイトルの番組が、複数の放送回分録画済みである場合に、再生部１７は、録画済みの複数の放送回から再生対象を選択可能な選択画面を表示パネル１１０に表示しても良い。この場合、再生対象の放送回は、ユーザがリモートコントローラ１１９を操作することによって選択されるものとしても良いし、音声によって選択されるものとしても良い。 When there are a plurality of candidate programs to be reproduced or recorded, the reproduction unit 17 or the recording unit 18 may display the candidate programs on the display panel 110 so as to be selectable. For example, when a program with a title specified by an instruction signal has been recorded for a plurality of broadcast times, the playback unit 17 displays a selection screen on the display panel 110 on which a playback target can be selected from the plurality of recorded broadcast times. It may be displayed. In this case, the broadcast times to be reproduced may be selected by the user operating the remote controller 119, or may be selected by voice.

また、Ｓ１０において、意図判断サーバ３０の判断部３２によって第１の音声認識データ９１が第１の番組情報を含まないと判断された場合に、分岐Ｓ２００の処理が実行される。具体的には、意図判断サーバ３０の判断部３２は、第１の音声認識データ９１が第１の番組情報を含まないと判断したという判断結果を、音声アシスタントサーバ５０に送信する（Ｓ１５）。 Further, in S10, when the determination unit 32 of the intention determination server 30 determines that the first voice recognition data 91 does not include the first program information, the processing of the branch S200 is executed. Specifically, the determination unit 32 of the intention determination server 30 transmits the determination result that the first voice recognition data 91 does not include the first program information to the voice assistant server 50 (S15).

この場合、ユーザ９によって発話された音声９０は番組に関するものではないため、音声アシスタントサーバ５０は、第１の音声認識データ９１に基づいて、その他の音声アシスタント処理を開始する（Ｓ１６）。その他の音声アシスタント処理は、テレビジョン装置１０に対する操作以外の処理とする。音声アシスタントサーバ５０は、汎用的な音声アシスタントサービスを実行するため、その他の音声アシスタント処理の内容は特に限定するものではない。ここで、図１０のシーケンス図に示す処理は終了する。 In this case, since the voice 90 spoken by the user 9 is not related to the program, the voice assistant server 50 starts other voice assistant processing based on the first voice recognition data 91 (S16). Other audio assistant processing is processing other than the operation for the television device 10. Since the voice assistant server 50 executes a general-purpose voice assistant service, the contents of other voice assistant processing are not particularly limited. At this point, the process shown in the sequence diagram of FIG. 10 ends.

このように、本実施形態の意図判断サーバ３０では、音声９０が音声アシスタントサーバ５０によって音声認識された第１の音声認識データ９１が第１の番組情報を含むと判断した場合に、第１の番組情報を、番組情報認識サーバ２０によって音声９０が音声認識された第２の音声認識データ９２に含まれる第２の番組情報に置換する。このため、本実施形態の意図判断サーバ３０によれば、汎用的な音声認識による音声認識結果を取得した上で、番組に関する情報については専用の辞書８０を用いた音声認識結果を採用することにより、番組に関する情報についての音声認識の精度を向上させることができる。 As described above, in the intention determination server 30 of the present embodiment, when the voice 90 determines that the first voice recognition data 91 voice-recognized by the voice assistant server 50 includes the first program information, the first The program information is replaced with the second program information included in the second voice recognition data 92 in which the voice 90 is voice-recognized by the program information recognition server 20. Therefore, according to the intention determination server 30 of the present embodiment, after acquiring the voice recognition result by general-purpose voice recognition, the voice recognition result using the dedicated dictionary 80 is adopted for the information about the program. , It is possible to improve the accuracy of voice recognition for information about a program.

例えば、汎用的な音声アシスタントサービスのみを利用する場合、番組タイトル等の番組に関する情報を高精度に音声認識することが困難な場合がある。また、番組タイトル等の番組に関する情報専用の音声アシスタントサービスを使用する場合、番組に関する音声命令以外について高精度に音声認識することが困難な場合がある。これに対して、本実施形態の意図判断サーバ３０では、音声アシスタントサーバ５０と番組情報認識サーバ２０がそれぞれ音声認識した結果を取得し、番組に関する情報については番組情報認識サーバ２０の認識結果を採用する。これにより、本実施形態の意図判断サーバ３０によれば、汎用的な音声認識と、番組に関する情報の高精度な音声認識とを両立することができる。 For example, when using only a general-purpose voice assistant service, it may be difficult to recognize information about a program such as a program title with high accuracy. In addition, when using a voice assistant service dedicated to information about a program such as a program title, it may be difficult to recognize voice with high accuracy except for voice commands related to the program. On the other hand, in the intention determination server 30 of the present embodiment, the voice assistant server 50 and the program information recognition server 20 each acquire the result of voice recognition, and the recognition result of the program information recognition server 20 is adopted for the information about the program. do. As a result, according to the intention determination server 30 of the present embodiment, it is possible to achieve both general-purpose voice recognition and highly accurate voice recognition of information about the program.

また、本実施形態の意図判断サーバ３０は、第１の番組情報が第２の番組情報に置換された第３の音声認識データに基づいて、テレビジョン装置１０の動作を制御する。これにより、本実施形態の意図判断サーバ３０によれば、番組に関する情報の高精度な音声認識結果に基づいて、テレビジョン装置１０の動作を制御することができる。 Further, the intention determination server 30 of the present embodiment controls the operation of the television device 10 based on the third voice recognition data in which the first program information is replaced with the second program information. Thereby, according to the intention determination server 30 of the present embodiment, the operation of the television apparatus 10 can be controlled based on the highly accurate voice recognition result of the information about the program.

また、本実施形態の第１の番組情報および第２の番組情報は、番組のタイトルと、番組のジャンルと、番組の出演者名と、のいずれかを含む。より具体的には、本実施形態の第１の番組情報および第２の番組情報は番組タイトルである。また、番組情報認識サーバ２０は、番組タイトルの発音が登録された辞書８０に基づいて、音声９０に含まれる番組のタイトルを第２の番組情報として特定する。このため、本実施形態の意図判断サーバ３０によれば、番組タイトルの読み方が一般的な読み方と異なっていた場合でも、番組情報認識サーバ２０によって高精度に音声認識された番組タイトルを取得することができる。 Further, the first program information and the second program information of the present embodiment include any one of a program title, a program genre, and a program performer name. More specifically, the first program information and the second program information of the present embodiment are program titles. Further, the program information recognition server 20 specifies the title of the program included in the voice 90 as the second program information based on the dictionary 80 in which the pronunciation of the program title is registered. Therefore, according to the intention determination server 30 of the present embodiment, even if the reading of the program title is different from the general reading, the program title that has been voice-recognized with high accuracy by the program information recognition server 20 is acquired. Can be done.

また、本実施形態の情報処理システムＳ１は、テレビジョン装置１０と、番組情報認識サーバ２０と、意図判断サーバ３０と、記憶サーバ４０とを備える。本実施形態のテレビジョン装置１０は、入力された音声９０を、番組情報認識サーバ２０と、音声アシスタントサーバ５０と、に音声信号として送信する。また、本実施形態の番組情報認識サーバ２０は、辞書８０に登録された情報に基づいて、音声信号に含まれる第２の番組情報を特定し、特定した第２の番組情報を含む第２の音声認識データ９２を出力する。また、本実施形態の記憶サーバ４０は、番組情報認識サーバ２０によって特定された第２の番組情報を記憶する。また、本実施形態の意図判断サーバ３０は、音声アシスタントサーバ５０から取得した第１の音声認識データ９１が第１の番組情報を含むと判断した場合に、第１の音声認識データ９１に含まれる第１の番組情報を第２の番組情報に置換する。また、本実施形態の意図判断サーバ３０は、置換処理後の第３の音声認識データに基づいて、テレビジョン装置１０の動作を制御する。このように、本実施形態の情報処理システムＳ１では、ユーザ９が発話した音声９０に対して、汎用的な音声認識と、辞書８０を用いた音声認識との両方を実施する。このため、本実施形態の情報処理システムＳ１によれば、汎用的な音声認識による音声アシスタントサービスをユーザ９に提供すると共に、番組に関する情報を高精度に音声認識した結果に基づいてテレビジョン装置１０を制御することができる。つまり、本実施形態の情報処理システムＳ１によれば、音声アシスタントサービスをユーザ９に提供すると共に、番組に関する情報を高精度に音声認識した結果を音声アシスタントサービスに利用することができる。 Further, the information processing system S1 of the present embodiment includes a television device 10, a program information recognition server 20, an intention determination server 30, and a storage server 40. The television apparatus 10 of the present embodiment transmits the input voice 90 to the program information recognition server 20 and the voice assistant server 50 as voice signals. Further, the program information recognition server 20 of the present embodiment identifies the second program information included in the voice signal based on the information registered in the dictionary 80, and the second program information including the specified second program information is included. The voice recognition data 92 is output. Further, the storage server 40 of the present embodiment stores the second program information specified by the program information recognition server 20. Further, the intention determination server 30 of the present embodiment is included in the first voice recognition data 91 when it is determined that the first voice recognition data 91 acquired from the voice assistant server 50 includes the first program information. The first program information is replaced with the second program information. Further, the intention determination server 30 of the present embodiment controls the operation of the television apparatus 10 based on the third voice recognition data after the replacement process. As described above, in the information processing system S1 of the present embodiment, both general-purpose voice recognition and voice recognition using the dictionary 80 are performed for the voice 90 spoken by the user 9. Therefore, according to the information processing system S1 of the present embodiment, the voice assistant service by general-purpose voice recognition is provided to the user 9, and the television device 10 is based on the result of voice recognition of information about the program with high accuracy. Can be controlled. That is, according to the information processing system S1 of the present embodiment, the voice assistant service can be provided to the user 9, and the result of voice recognition of the information about the program with high accuracy can be used for the voice assistant service.

例えば、汎用的な音声アシスタントサービスとは別に、テレビジョン装置を制御するために、番組に関する情報専用の音声アシスタントサービスを別途利用する場合、ユーザは２つの音声アシスタントサービスを用途に応じて使い分けることとなり、操作が煩わしくなる可能性がある。これに対して、本実施形態によれば、テレビジョン装置１０は、入力された音声９０を、番組情報認識サーバ２０と、音声アシスタントサーバ５０と、に音声信号として送信するため、ユーザ９は音声アシスタントサービスを意識的に使い分けなくとも、番組に関する情報についての高精度な音声認識を利用することができる。 For example, if a voice assistant service dedicated to information about a program is used separately to control a television device in addition to a general-purpose voice assistant service, the user will use the two voice assistant services properly according to the purpose. , The operation may be troublesome. On the other hand, according to the present embodiment, the television device 10 transmits the input voice 90 to the program information recognition server 20 and the voice assistant server 50 as voice signals, so that the user 9 voices. You can use highly accurate voice recognition for information about programs without having to consciously use the assistant service properly.

なお、本実施形態においては、テレビジョン装置１０を映像装置の一例としたが、映像装置は、ＢＤレコーダやＤＶＤレコーダ等であっても良いし、テレビジョン装置１０に接続された音声入力装置であっても良い。 In the present embodiment, the television device 10 is used as an example of the video device, but the video device may be a BD recorder, a DVD recorder, or the like, or may be an audio input device connected to the television device 10. There may be.

また、本実施形態においては意図判断サーバ３０からの指示信号を受けてテレビジョン装置１０が録画または再生を実行するものとしたが、指示信号に基づいて実行される処理はこれに限定されるものではない。 Further, in the present embodiment, the television device 10 executes recording or playback in response to the instruction signal from the intention determination server 30, but the processing executed based on the instruction signal is limited to this. is not it.

また、マイク１１７を本実施形態における音声入力部の一例としても良い。また、音声入力部１１とマイク１１７とオーディオＩ／Ｆ１１８とを、音声入力部としても良い。 Further, the microphone 117 may be used as an example of the voice input unit in the present embodiment. Further, the voice input unit 11, the microphone 117, and the audio I / F 118 may be used as the voice input unit.

なお、本実施形態においては、音声アシスタントサーバ５０が構文解析を行うものとしたが、意図判断サーバ３０が構文解析を行うものとしても良い。この場合は、音声アシスタントサーバ５０は、意図判断サーバ３０に対して、テレビジョン装置ＩＤと、音声ＩＤと、第１の音声認識データ９１とを対応付けて送信する。 In the present embodiment, the voice assistant server 50 performs the parsing, but the intention determination server 30 may perform the parsing. In this case, the voice assistant server 50 transmits the television device ID, the voice ID, and the first voice recognition data 91 to the intention determination server 30 in association with each other.

（変形例１）
上述の実施形態では、番組情報は番組タイトルであるものとして説明したが、番組情報はこれに限定されるものではなく、番組のジャンルや番組の出演者名でも良い。また、番組情報認識サーバ２０は、番組のジャンルや番組の出演者名が登録された辞書を記憶するものとしても良い。 (Modification 1)
In the above-described embodiment, the program information is described as being the program title, but the program information is not limited to this, and may be the genre of the program or the name of the performer of the program. Further, the program information recognition server 20 may store a dictionary in which the genre of the program and the names of performers of the program are registered.

図１１は、本変形例にかかる辞書１０８０の一例を示す図である。図１１に示すように、番組情報認識サーバ２０の記憶部２５に登録される辞書１０８０は、番組のタイトルの文字データと、番組のタイトルの発音を示す情報（読み仮名）とに加えて、さらに、番組のジャンル、当該番組の出演者名の文字データ、当該番組の出演者名の発音、当該番組の出演者を含むグループ名、当該番組の出演者を含むグループ名の発音等を対応付けて記憶すするものとしても良い。また、これらの情報は複数のデータベースに分散されて登録されても良い。 FIG. 11 is a diagram showing an example of the dictionary 1080 according to this modification. As shown in FIG. 11, the dictionary 1080 registered in the storage unit 25 of the program information recognition server 20 further includes character data of the program title, information indicating the pronunciation of the program title (reading pseudonym), and the like. , Program genre, character data of the performer name of the program, pronunciation of the performer name of the program, group name including the performer of the program, pronunciation of the group name including the performer of the program, etc. It may be something to remember. Further, these pieces of information may be distributed and registered in a plurality of databases.

例えば、番組情報認識サーバ２０の特定部２２は、ユーザ９が発話した音声９０が番組の出演者名を含む場合に、当該出演者が出演する番組を、第２の番組情報として特定しても良い。また、ユーザ９が発話した音声９０がグループ名を含む場合に、当該グループに所属するメンバーが出演者として登録されている番組を、第２の番組情報として特定しても良い。また、芸能人によっては、芸名が途中で変更される場合や、複数の愛称で呼ばれる場合がある。辞書１０８０には、出演者の発音として、新旧の複数の芸名や、愛称等の発音を登録されても良い。 For example, the specific unit 22 of the program information recognition server 20 may specify the program in which the performer appears as the second program information when the voice 90 spoken by the user 9 includes the performer name of the program. good. Further, when the voice 90 spoken by the user 9 includes the group name, the program in which the member belonging to the group is registered as a performer may be specified as the second program information. Also, depending on the entertainer, the stage name may be changed in the middle, or it may be called by multiple nicknames. In the dictionary 1080, a plurality of old and new stage names, nicknames, and the like may be registered as pronunciations of the performers.

このように辞書１０８０に番組に関する種々の情報を記憶することにより、番組情報認識サーバ２０は、ユーザ９が発話した音声９０に含まれる番組に関する情報を、より高精度に特定することができる。 By storing various information about the program in the dictionary 1080 in this way, the program information recognition server 20 can specify the information about the program included in the voice 90 spoken by the user 9 with higher accuracy.

（変形例２）
上述の実施形態では、番組情報認識サーバ２０による音声認識結果は、テレビジョン装置１０に対して送信された後、テレビジョン装置１０によって記憶サーバ４０に送信されていたが、音声認識結果の送信経路はこれに限定されるものではない。 (Modification 2)
In the above-described embodiment, the voice recognition result by the program information recognition server 20 is transmitted to the television device 10, and then transmitted to the storage server 40 by the television device 10. Is not limited to this.

図１２は、本変形例にかかる情報処理システムＳ２の全体構成の一例を示す図である。図１２に示すように、情報処理システムＳ２は、テレビジョン装置１０１０と、番組情報認識サーバ１０２０と、意図判断サーバ３０と、記憶サーバ１０４０とを備える。また、テレビジョン装置１０１０および意図判断サーバ３０は、音声アシスタントサーバ５０とネットワークを介して接続している。 FIG. 12 is a diagram showing an example of the overall configuration of the information processing system S2 according to this modification. As shown in FIG. 12, the information processing system S2 includes a television device 1010, a program information recognition server 1020, an intention determination server 30, and a storage server 1040. Further, the television device 1010 and the intention determination server 30 are connected to the voice assistant server 50 via a network.

本変形例の番組情報認識サーバ１０２０は、上述の実施形態の機能を備えた上で、音声認識結果を、記憶サーバ１０４０に対して送信する。より具体的には、番組情報認識サーバ１０２０の出力部２３は、テレビジョン装置ＩＤと、音声ＩＤと、第２の音声認識データ９２と、番組情報の特定結果とを対応付けて、記憶サーバ１０４０に出力する。 The program information recognition server 1020 of this modification has the functions of the above-described embodiment, and transmits the voice recognition result to the storage server 1040. More specifically, the output unit 23 of the program information recognition server 1020 associates the television device ID, the voice ID, the second voice recognition data 92, and the specific result of the program information with the storage server 1040. Output to.

また、本変形例の記憶サーバ１０４０は、上述の実施形態の機能を備えた上で、番組情報認識サーバ１０２０から送信されたテレビジョン装置ＩＤと、音声ＩＤと、第２の音声認識データ９２と、番組情報の特定結果とを記憶する。 Further, the storage server 1040 of the present modification has the functions of the above-described embodiment, and the television device ID, the voice ID, and the second voice recognition data 92 transmitted from the program information recognition server 1020. , Stores the specific result of the program information.

また、本変形例の意図判断サーバ３０と、テレビジョン装置１０１０と、音声アシスタントサーバ５０とは、上述の実施形態の機能を備える。 Further, the intention determination server 30, the television device 1010, and the voice assistant server 50 of this modification have the functions of the above-described embodiment.

本変形例のように番組情報認識サーバ１０２０から記憶サーバ１０４０に対して音声認識結果が直接送信されることにより、テレビジョン装置１０１０が番組情報認識サーバ１０２０と記憶サーバ１０４０との間で情報の媒介をしなくとも、音声認識結果を記憶サーバ１０４０に保存することができる。 By directly transmitting the voice recognition result from the program information recognition server 1020 to the storage server 1040 as in this modification, the television device 1010 mediates information between the program information recognition server 1020 and the storage server 1040. The voice recognition result can be saved in the storage server 1040 without doing this.

（変形例３）
上述の実施形態では、番組情報認識サーバ２０と、意図判断サーバ３０と、記憶サーバ４０と、音声アシスタントサーバ５０とは、それぞれ別々のサーバとして構築されているものとして説明したが、複数のサーバの機能が１つのサーバで実現されても良い。例えば、音声アシスタントサーバ５０と、意図判断サーバ３０とが１つのサーバに統合されても良い。また、番組情報認識サーバ２０と、記憶サーバ４０とが１つのサーバに統合されても良い。意図判断サーバ３０と、記憶サーバ４０とが１つのサーバに統合されても良い。また、仮想化等の技術によって、１つのサーバの機能を、複数台のコンピュータによって実現するように構成しても良い。 (Modification 3)
In the above-described embodiment, the program information recognition server 20, the intention determination server 30, the storage server 40, and the voice assistant server 50 have been described as being constructed as separate servers, but a plurality of servers may be used. The function may be realized by one server. For example, the voice assistant server 50 and the intention determination server 30 may be integrated into one server. Further, the program information recognition server 20 and the storage server 40 may be integrated into one server. The intention determination server 30 and the storage server 40 may be integrated into one server. Further, the function of one server may be realized by a plurality of computers by a technique such as virtualization.

以上説明したとおり、上述の実施形態によれば、汎用的な音声認識による音声認識結果に対して、番組に関する情報については専用の辞書８０，１０８０を用いた音声認識結果を採用することにより、番組に関する情報についての音声認識の精度を向上させることができる。 As described above, according to the above-described embodiment, the program is provided by adopting the voice recognition result using the dedicated dictionaries 80 and 1080 for the information about the program, as opposed to the voice recognition result by general-purpose voice recognition. It is possible to improve the accuracy of speech recognition for information about.

実施形態のテレビジョン装置１０，１０１０、番組情報認識サーバ２０，１０２０、意図判断サーバ３０、記憶サーバ４０，１０４０、音声アシスタントサーバ５０で実行されるプログラムは、インストール可能な形式又は実行可能な形式のファイルでＣＤ－ＲＯＭ、フレキシブルディスク（ＦＤ）、ＣＤ－Ｒ、ＤＶＤ（Digital Versatile Disk）等のコンピュータで読み取り可能な記録媒体に記録されて提供される。 The programs executed by the television devices 10, 1010, the program information recognition servers 20, 1020, the intention determination server 30, the storage servers 40, 1040, and the voice assistant server 50 of the embodiment are in an installable format or an executable format. The file is recorded and provided on a computer-readable recording medium such as a CD-ROM, a flexible disk (FD), a CD-R, or a DVD (Digital Versatile Disk).

また、実施形態のテレビジョン装置１０，１０１０、番組情報認識サーバ２０，１０２０、意図判断サーバ３０、記憶サーバ４０，１０４０、音声アシスタントサーバ５０で実行されるプログラムを、インターネット等のネットワークに接続されたコンピュータ上に格納し、ネットワーク経由でダウンロードさせることにより提供するように構成しても良い。また、実施形態のテレビジョン装置１０，１０１０、番組情報認識サーバ２０，１０２０、意図判断サーバ３０、記憶サーバ４０，１０４０、音声アシスタントサーバ５０で実行されるプログラムをインターネット等のネットワーク経由で提供または配布するように構成しても良い。また、実施形態のテレビジョン装置１０，１０１０、番組情報認識サーバ２０，１０２０、意図判断サーバ３０、記憶サーバ４０，１０４０、音声アシスタントサーバ５０で実行されるプログラムを、ＲＯＭ等に予め組み込んで提供するように構成してもよい。 Further, the programs executed by the television devices 10, 1010, the program information recognition servers 20, 1020, the intention determination server 30, the storage servers 40, 1040, and the voice assistant server 50 of the embodiment are connected to a network such as the Internet. It may be configured to be provided by storing it on a computer and downloading it via a network. Further, the programs executed by the television devices 10, 1010, the program information recognition servers 20, 1020, the intention determination server 30, the storage servers 40, 1040, and the voice assistant server 50 of the embodiment are provided or distributed via a network such as the Internet. It may be configured to do so. Further, the programs executed by the television devices 10, 1010, the program information recognition servers 20, 1020, the intention determination server 30, the storage servers 40, 1040, and the voice assistant server 50 of the embodiment are provided by incorporating them into a ROM or the like in advance. It may be configured as follows.

実施形態のテレビジョン装置１０，１０１０で実行されるプログラムは、上述した各部（音声入力部、ウェイクワード判断部、第１の送信部、第１の受信部、第２の送信部、第２の受信部、再生部、録画部）を含むモジュール構成となっており、実際のハードウェアとしてはＣＰＵ（プロセッサ）が上記記憶媒体からプログラムを読み出して実行することにより上記各部が主記憶装置上にロードされ、音声入力部、ウェイクワード判断部、第１の送信部、第１の受信部、第２の送信部、第２の受信部、再生部、録画部が主記憶装置上に生成されるようになっている。 The programs executed by the television devices 10 and 1010 of the embodiment include the above-mentioned units (voice input unit, wake word determination unit, first transmission unit, first reception unit, second transmission unit, second transmission unit). It has a module configuration including a receiving unit, a playback unit, and a recording unit), and as actual hardware, the CPU (processor) reads a program from the storage medium and executes it, so that each unit is loaded on the main storage device. A voice input unit, a wake word determination unit, a first transmission unit, a first reception unit, a second transmission unit, a second reception unit, a reproduction unit, and a recording unit are generated on the main storage device. It has become.

実施形態の番組情報認識サーバ２０，１０２０で実行されるプログラムは、上述した各部（受信部、特定部、出力部）を含むモジュール構成となっており、実際のハードウェアとしてはＣＰＵが上記記憶媒体からプログラムを読み出して実行することにより上記各部が主記憶装置上にロードされ、受信部、特定部、出力部が主記憶装置上に生成されるようになっている。 The program executed by the program information recognition servers 20 and 1020 of the embodiment has a module configuration including each of the above-mentioned parts (reception part, specific part, output part), and the CPU is the storage medium as the actual hardware. By reading and executing the program from, the above-mentioned parts are loaded on the main storage device, and the receiving part, the specific part, and the output part are generated on the main storage device.

実施形態の意図判断サーバ３０で実行されるプログラムは、上述した各部（取得部、判断部、置換部、映像装置制御部）を含むモジュール構成となっており、実際のハードウェアとしてはＣＰＵが上記記憶媒体からプログラムを読み出して実行することにより上記各部が主記憶装置上にロードされ、取得部、判断部、置換部、映像装置制御部が主記憶装置上に生成されるようになっている。 The program executed by the intention determination server 30 of the embodiment has a module configuration including each of the above-mentioned parts (acquisition part, judgment part, replacement part, video apparatus control part), and the CPU is the above-mentioned actual hardware. By reading a program from the storage medium and executing the program, each of the above units is loaded on the main storage device, and an acquisition unit, a determination unit, a replacement unit, and a video apparatus control unit are generated on the main storage device.

実施形態の記憶サーバ４０，１０４０で実行されるプログラムは、上述した各部（保存処理部、検索部）を含むモジュール構成となっており、実際のハードウェアとしてはＣＰＵが上記記憶媒体からプログラムを読み出して実行することにより上記各部が主記憶装置上にロードされ、保存処理部、検索部が主記憶装置上に生成されるようになっている。 The program executed by the storage servers 40 and 1040 of the embodiment has a module configuration including each of the above-mentioned parts (storing processing unit and search unit), and as actual hardware, the CPU reads the program from the storage medium. Each of the above-mentioned parts is loaded on the main storage device, and the storage processing unit and the search unit are generated on the main storage device.

本発明のいくつかの実施形態を説明したが、これらの実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。これら新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 Although some embodiments of the present invention have been described, these embodiments are presented as examples and are not intended to limit the scope of the invention. These novel embodiments can be implemented in various other embodiments, and various omissions, replacements, and changes can be made without departing from the gist of the invention. These embodiments and variations thereof are included in the scope and gist of the invention, and are also included in the scope of the invention described in the claims and the equivalent scope thereof.

１０，１０１０テレビジョン装置
１１音声入力部
１２ウェイクワード判断部
１３第１の送信部
１４第１の受信部
１５第２の送信部
１６第２の受信部
１７再生部
１８録画部
２０，１０２０番組情報認識サーバ
２１受信部
２２特定部
２３出力部
２５記憶部
３０意図判断サーバ
３１取得部
３２判断部
３３置換部
３４映像装置制御部
３５記憶部
４０，１０４０記憶サーバ
４１保存処理部
４２検索部
４５記憶部
５０音声アシスタントサーバ
８０，１０８０辞書
８１音声情報
９０音声
９１第１の音声認識データ
９２第２の音声認識データ
１１７マイク
Ｓ１，Ｓ２情報処理システム 10, 1010 Television device 11 Audio input unit 12 Wake word judgment unit 13 First transmission unit 14 First reception unit 15 Second transmission unit 16 Second reception unit 17 Playback unit 18 Recording unit 20,1020 Program information Recognition server 21 Reception unit 22 Specific unit 23 Output unit 25 Storage unit 30 Intention judgment server 31 Acquisition unit 32 Judgment unit 33 Replacement unit 34 Video device control unit 35 Storage unit 40, 1040 Storage server 41 Storage processing unit 42 Search unit 45 Storage unit 50 Voice assistant server 80, 1080 Dictionary 81 Voice information 90 Voice 91 First voice recognition data 92 Second voice recognition data 117 Mike S1, S2 Information processing system

Claims

An acquisition unit that acquires the first voice recognition data in which the spoken voice is voice-recognized by the first voice recognition device and the syntax analysis result of the first voice recognition data.
Based on the syntax analysis result, a determination unit for determining whether or not the first voice recognition data includes the first program information related to the program, and a determination unit.
When the determination unit determines that the first voice recognition data includes the first program information, the voice is recognized by the second voice recognition device having a dictionary in which information about the program is registered. A replacement unit that acquires the second program information included in the second voice recognition data and replaces the first program information included in the first voice recognition data with the second program information.
Information processing device equipped with.

A video device that controls the operation of the video device based on the third voice recognition data in which the first program information included in the first voice recognition data is replaced with the second program information by the replacement unit. Further equipped with a control unit,
The information processing apparatus according to claim 1.

An information processing system including a video device, a voice recognition device, a storage device, and an information processing device.
The video device is
A voice input unit for inputting spoken voice,
A transmission unit for transmitting the voice as a voice signal to the voice recognition device, another voice recognition device different from the voice recognition device, and the like.
The voice recognition device is
A storage unit that stores a dictionary in which information about programs is registered,
A specific unit that specifies program information related to a program included in the audio signal based on the information registered in the dictionary, and a specific unit.
It is equipped with an output unit that outputs the specified program information.
The storage device is
The program information specified by the voice recognition device is stored, and the program information is stored.
The information processing device is
An acquisition unit that acquires the first voice recognition data in which the voice signal is voice-recognized and the syntax analysis result of the first voice recognition data from the other voice recognition device.
Based on the syntax analysis result, a determination unit for determining whether or not the first voice recognition data includes the first program information related to the program, and a determination unit.
When the determination unit determines that the first voice recognition data includes the first program information, the voice signal is included in the second voice recognition data recognized by the voice recognition device. A replacement unit that acquires the program information of 2 and replaces the first program information included in the first voice recognition data with the second program information.
An image that controls the operation of the video apparatus based on the third voice recognition data in which the first program information included in the first voice recognition data is replaced with the second program information by the replacement unit. Equipped with a device control unit,
Information processing system.

A voice input unit for inputting spoken voice,
A first transmission unit that transmits the voice to the first voice recognition device and the second voice recognition device as a voice signal.
A first receiving unit that receives program information about a program included in the voice signal from the second voice recognition device, and a first receiving unit.
A second transmission unit that transmits the received program information to an external device, and
A second receiving unit that receives an instruction signal related to the program information specified by the second voice recognition device, and
Video equipment equipped with.