JP7572388B2

JP7572388B2 - Data processing device, data processing method and program

Info

Publication number: JP7572388B2
Application number: JP2022024675A
Authority: JP
Inventors: 亜菲劉; 忠行福原
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2022-02-21
Filing date: 2022-02-21
Publication date: 2024-10-23
Anticipated expiration: 2042-02-21
Also published as: JP2023121372A

Description

本発明は、データ処理装置、データ処理方法及びプログラムに関する。 The present invention relates to a data processing device, a data processing method, and a program.

テキストデータと音素データを合成させて所望の音声でテキストを読み上げる技術が知られている（例えば、特許文献１を参照）。 Technology is known that synthesizes text data and phoneme data to read text in a desired voice (see, for example, Patent Document 1).

特開２００２－３２８６９４号公報JP 2002-328694 A

しかしこのような音声合成技術においては、用意された音素を組み合わせてテキストの読上げを行うが、例えば忙しい両親が自分の声で朗読した読上げを聞かせることにより家族の絆を深めたいというユーザのニーズに十分にこたえることができていなかった。また、従来の音声合成装置においては、テキストデータを機械的に読み上げたような音声が出力されるため、聞き手が違和感を覚えるという問題が生じていた。 However, this type of speech synthesis technology combines prepared phonemes to read text aloud, but does not fully meet the needs of users, such as busy parents who want to deepen family bonds by listening to someone read aloud in their own voice. Furthermore, conventional speech synthesis devices output a voice that sounds like a mechanical reading of text data, which can be unnatural to the listener.

そこで、本発明はこれらの点に鑑みてなされたものであり、他の朗読者の間や抑揚のある朗読を所望する音声で再現したテキストの読み上げができるようにすることを目的とする。 The present invention has been made in consideration of these points, and aims to enable reading of text reproduced in the desired voice, with the desired intonation or between other readers.

本発明の第１の態様のデータ処理装置においては、文字列から構成されるコンテンツを朗読する第１話者の音声の特徴量を示す時系列データである朗読情報であって、コンテンツの再生時間における時刻と、該時刻において発声される文字と、音量と、音高と、を関連づけた朗読情報を記憶する朗読情報記憶部と、第２話者が発声した音声をサンプリングして生成された発声情報であって、複数の文字と、複数の文字それぞれを発音する際に発声された声の音色と、を関連づけて記憶した発声情報を記憶する発声情報記憶部と、前記朗読情報の前記時刻それぞれにおける前記文字に前記発声情報において対応する前記音色と該時刻における前記音量と前記音高とからなる音として出力させる読上データを生成する生成部と、前記読上データを出力するよう制御する出力制御部と、を有する。 The data processing device of the first aspect of the present invention includes a recitation information storage unit that stores recitation information, which is time-series data indicating features of the voice of a first speaker reciting content consisting of character strings, and which associates a time in the playback time of the content with a character spoken at that time, a volume, and a pitch; a vocalization information storage unit that stores vocalization information generated by sampling a voice spoken by a second speaker, which associates multiple characters with the timbre of the voice spoken when pronouncing each of the multiple characters; a generation unit that generates recitation data for outputting the characters at each of the times in the recitation information as a sound consisting of the timbre corresponding in the vocalization information and the volume and pitch at that time; and an output control unit that controls the output of the recitation data.

前記朗読情報記憶部は、前記コンテンツにおいて基準となる音高を示す第１基準音高データをさらに関連付けた前記朗読情報を記憶し、前記発声情報記憶部は前記第２話者の基準となる音高を示す第２基準音高データをさらに記憶し、前記生成部は、前記朗読情報の前記時刻それぞれにおける前記文字に前記発声情報において対応する前記音色を、該時刻における音高と、第１基準音高データと第２基準音高データとの比に基づいて決定した音高で出力させる前記読上データを生成してもよい。 The reading information storage unit may store the reading information further associated with first reference pitch data indicating a reference pitch in the content, the vocalization information storage unit may further store second reference pitch data indicating a reference pitch for the second speaker, and the generation unit may generate the reading data for outputting the tone corresponding in the vocalization information to the characters at each of the times in the reading information at a pitch determined based on the pitch at that time and the ratio between the first reference pitch data and the second reference pitch data.

複数の単語と、前記複数の単語それぞれが対応する方言と、前記方言を構成する１以上の文字と、前記方言を構成する文字を発音するための音高と、を関連付けた方言情報を記憶する方言情報記憶部をさらに有し、前記朗読情報記憶部は、前記朗読情報を構成する単語と、前記単語を構成する１以上の前記文字と、をさらに関連付けた前記朗読情報を記憶し、前記生成部は、前記朗読情報に含まれる前記１以上の文字と前記１以上の文字に対応する音高とを、前記１以上の文字それぞれが構成する前記単語に前記方言情報において対応する前記方言に含まれる文字と前記方言に含まれる文字を発音するための音高とで置換した置換朗読情報をさらに生成し、生成した置換朗読情報の前記時刻それぞれにおける置換後の文字に前記発声情報において対応する前記音色と該時刻における前記音量と置換後の音高とからなる音として出力させる前記読上データを生成してもよい。 The device may further include a dialect information storage unit that stores dialect information associating a plurality of words, a dialect to which each of the plurality of words corresponds, one or more characters constituting the dialect, and a pitch for pronouncing the characters constituting the dialect, and the reading information storage unit stores the reading information further associating the words constituting the reading information with one or more of the characters constituting the words, and the generation unit may further generate replacement reading information in which the one or more characters and the pitch corresponding to the one or more characters included in the reading information are replaced with characters included in the dialect corresponding in the dialect information to the words constituting each of the one or more characters, and a pitch for pronouncing the characters included in the dialect, and generate the reading data that causes the replaced characters at each of the times of the generated replacement reading information to be output as a sound consisting of the tone corresponding in the utterance information, the volume at that time, and the replaced pitch.

前記朗読情報記憶部は、前記時刻と、フレーズを挿入するタイミングを示すフラグを関連付けた前記朗読情報を記憶し、前記生成部は、前記フラグが示すタイミングに複数の所定のフレーズから選択したフレーズを出力させる前記読上データを生成させてもよい。 The reading information storage unit may store the reading information in which the time is associated with a flag indicating the timing of inserting a phrase, and the generation unit may generate the reading data that outputs a phrase selected from a plurality of predetermined phrases at the timing indicated by the flag.

前記発声情報記憶部は、前記第２話者に対応する画像データをさらに記憶し、前記出力制御部は、前記読上データを出力するよう制御している場合に前記第２話者に対応する画像を表示部に表示させるよう制御してもよい。 The speech information storage unit may further store image data corresponding to the second speaker, and the output control unit may control the display unit to display an image corresponding to the second speaker when controlling the reading data to be output.

ユーザを撮像した撮像データを取得する撮像データ取得部と、前記撮像データ取得部から取得した前記撮像データを画像解析することで前記ユーザの状態を判定する判定部と、をさらに有し、前記出力制御部は、前記判定部が前記ユーザの状態を前記ユーザが眠っていると判定した場合に、前記読上データの出力を停止し、又は前記読上データの出力態様を変更してもよい。 The device further includes an imaging data acquisition unit that acquires imaging data of a user, and a determination unit that determines the state of the user by performing image analysis on the imaging data acquired from the imaging data acquisition unit, and the output control unit may stop outputting the reading data or change the output mode of the reading data when the determination unit determines that the state of the user is that the user is asleep.

ユーザを撮像した撮像データを取得する撮像データ取得部と、前記撮像データ取得部から取得した前記撮像データを画像解析することで前記ユーザの状態を判定し、判定した前記ユーザの状態を情報端末に通知する判定部と、をさらに有してもよい。 The device may further include an imaging data acquisition unit that acquires imaging data of a user, and a determination unit that performs image analysis of the imaging data acquired from the imaging data acquisition unit to determine the state of the user and notify the information terminal of the determined state of the user.

前記朗読情報記憶部は、前記コンテンツを複数の異なる第１話者それぞれが朗読した複数の前記朗読情報それぞれと、前記朗読情報それぞれが適する状況とを関連付けて記憶し、前記判定部は、前記ユーザの属性及び前記ユーザの状態の少なくともいずれかに基づいて前記ユーザの状況を判定し、前記生成部は、前記判定部が判定した前記ユーザの状況に関連付けられた前記朗読情報を選択し、選択した前記朗読情報の前記時刻それぞれにおける前記文字に前記発声情報において対応する前記音色と該時刻における前記音量と前記音高とからなる音として出力させる前記読上データを生成してもよい。 The reading information storage unit may store a plurality of pieces of reading information in which the content is read by a plurality of different first speakers, in association with a situation for which each piece of reading information is suitable, the determination unit may determine the user's situation based on at least one of the user's attributes and the user's state, and the generation unit may select the reading information associated with the user's state determined by the determination unit, and generate the reading data that outputs the characters at each time of the selected reading information as a sound consisting of the tone corresponding to the voice information in the speech information and the volume and pitch at that time.

前記朗読情報記憶部は、前記朗読情報において朗読の対象となる前記コンテンツに含まれる言葉と前記言葉が示す意味とを対応付けた辞書情報を前記朗読情報と関連付けてさらに記憶し、前記データ処理装置は、コンテンツを視聴するユーザが発話した音声情報を取得する音声情報取得部と、前記音声情報取得部が取得した音声情報を音声認識し、前記ユーザの発話内容を取得する音声認識部と、を有し、前記生成部は、前記音声認識部が取得した前記ユーザの発話内容が前記コンテンツに対する質問である場合に、前記辞書情報を参照し、前記質問に対する回答を示す回答情報を生成し、前記出力制御部は、前記回答情報を出力するよう制御してもよい。 The recitation information storage unit further stores dictionary information that associates words included in the content to be recited in the recitation information with the meanings of the words in association with the recitation information, and the data processing device has a voice information acquisition unit that acquires voice information spoken by a user viewing the content, and a voice recognition unit that performs voice recognition on the voice information acquired by the voice information acquisition unit and acquires the content of the user's utterance, and when the content of the user's utterance acquired by the voice recognition unit is a question about the content, the generation unit refers to the dictionary information and generates answer information indicating an answer to the question, and the output control unit may control to output the answer information.

反応情報記憶部をさらに有し、前記音声認識部は、前記ユーザの発話内容と、前記発話内容を前記ユーザが発話したタイミングに対応する、前記時刻と、を関連付けた反応情報を前記反応情報記憶部に記憶させてもよい。 The system may further include a reaction information storage unit, and the voice recognition unit may store reaction information in the reaction information storage unit that associates the user's utterance content with the time corresponding to the timing at which the user uttered the utterance content.

本発明の第２の態様のデータ処理方法においては、コンピュータが実行する、朗読情報記憶部に記憶された文字列から構成されるコンテンツを朗読する第１話者の音声の特徴量を示す時系列データである朗読情報であって、コンテンツの再生時間における時刻と、該時刻において発声される文字と、音量と、音高と、を関連づけた朗読情報を取得するステップと、発声情報記憶部に記憶された第２話者が発声した音声をサンプリングして生成された発声情報であって、複数の文字と、複数の文字それぞれを発音する際に発声された声の音色と、を関連づけて記憶した発声情報取得するステップと、前記朗読情報の前記時刻それぞれにおける前記文字に前記発声情報において対応する前記音色と該時刻における前記音量と前記音高とからなる音として出力させる読上データを生成するステップと、前記読上データを出力するよう制御するステップと、を有してもよい。 The data processing method of the second aspect of the present invention may include a step of acquiring recitation information, which is time-series data indicating features of the voice of a first speaker who recites content composed of character strings stored in a recitation information storage unit and executed by a computer, and which associates a time in the playback time of the content with a character spoken at that time, a volume, and a pitch; a step of acquiring utterance information, which is generated by sampling the voice spoken by a second speaker stored in the utterance information storage unit and which associates and stores a plurality of characters with the timbre of the voice spoken when pronouncing each of the plurality of characters; a step of generating recitation data for outputting the characters at each of the times in the recitation information as sounds consisting of the timbre corresponding in the utterance information and the volume and the pitch at the times; and a step of controlling the output of the recitation data.

本発明の第３の態様のプログラムにおいては、コンピュータに、朗読情報記憶部に記憶された文字列から構成されるコンテンツを朗読する第１話者の音声の特徴量を示す時系列データである朗読情報であって、コンテンツの再生時間における時刻と、該時刻において発声される文字と、音量と、音高と、を関連づけた朗読情報を取得するステップと、発声情報記憶部に記憶された第２話者が発声した音声をサンプリングして生成された発声情報であって、複数の文字と、複数の文字それぞれを発音する際に発声された声の音色と、を関連づけて記憶した発声情報取得するステップと、前記朗読情報の前記時刻それぞれにおける前記文字に前記発声情報において対応する前記音色と該時刻における前記音量と前記音高とからなる音として出力させる読上データを生成するステップと、前記読上データを出力するよう制御するステップと、を実行させる。 In the program of the third aspect of the present invention, a computer is caused to execute the steps of: acquiring recitation information, which is time-series data showing the characteristics of the voice of a first speaker who recites content composed of character strings stored in a recitation information storage unit, and which associates a time in the playback time of the content with the character spoken at that time, the volume, and the pitch; acquiring utterance information, which is generated by sampling the voice spoken by a second speaker stored in the utterance information storage unit, and which associates and stores a plurality of characters with the timbre of the voice spoken when pronouncing each of the plurality of characters; generating recitation data for outputting the characters at each of the times in the recitation information as sounds consisting of the timbre corresponding in the utterance information and the volume and the pitch at the time; and controlling the output of the recitation data.

本発明によれば、他の朗読者の間や抑揚のある朗読を所望する音声で再現したテキストの読み上げができるという効果を奏する。 The present invention has the effect of enabling text to be read aloud in a voice that reproduces the desired intonation and pauses between other readers.

実施の形態にかかるデータ処理システムＳの概要を説明する図である。FIG. 1 is a diagram illustrating an overview of a data processing system S according to an embodiment. データ処理装置１の構成を示すブロック図である。1 is a block diagram showing a configuration of a data processing device 1. FIG. 朗読情報記憶部１２１が記憶する朗読情報のデータ構造の一例を示す図である。4 is a diagram showing an example of a data structure of reading information stored in a reading information storage unit 121. FIG. 朗読情報に関連付けられたメタデータのデータ構造の一例を示す図である。FIG. 13 is a diagram showing an example of a data structure of metadata associated with reading information. 発声情報記憶部１２２が記憶する発声情報のデータ構造の一例を示す図である。4 is a diagram showing an example of a data structure of utterance information stored in the utterance information storage unit 122. FIG. 方言情報記憶部１２３が記憶する方言情報のデータ構造の一例を示す図である。A figure showing an example of the data structure of dialect information stored in the dialect information storage unit 123. 朗読情報記憶部１２１が記憶する辞書情報のデータ構造の一例を示す図である。4 is a diagram showing an example of a data structure of dictionary information stored in a reading information storage unit 121. FIG. データ処理装置１における処理の流れを示すフローチャートである。3 is a flowchart showing a process flow in the data processing device 1.

［データ処理装置１の概要］
図１は、実施の形態にかかるデータ処理システムＳの概要を説明する図である。データ処理システムＳは、コンテンツの朗読をユーザが聴取するためのシステムである。データ処理システムＳは、データ処理装置１と、情報端末２と、を有する。 [Overview of data processing device 1]
1 is a diagram illustrating an overview of a data processing system S according to an embodiment. The data processing system S is a system for allowing a user to listen to a reading of content. The data processing system S includes a data processing device 1 and an information terminal 2.

データ処理装置１は、第１話者による特徴のあるコンテンツの朗読を、朗読した話者と異なる第２話者の声質で再現した音声を出力するための読上データを生成する装置である。データ処理装置１は、例えばサーバやパーソナルコンピュータである。 The data processing device 1 is a device that generates reading data for outputting a voice that reproduces a reading of distinctive content by a first speaker in the voice quality of a second speaker that is different from the speaker who read the content. The data processing device 1 is, for example, a server or a personal computer.

好適な使用例として、第１話者はプロの朗読者、ナレーター、俳優、声優等である。第２話者は、例えば、コンテンツを聴取するユーザの親、兄弟又は祖父母、友人等若しくは作家等のコンテンツの作者である。データ処理装置１がこのような音声を用いた読上データを生成することで、ユーザは、プロのナレーター等が行った特徴のある朗読をユーザにとって身近な者の声で楽しむことができる。データ処理装置１を用いた読上げは、例えば、赤ちゃんに親の声を覚えさせたい又は親の声を聴かせて安心させたいといった場合に有効である。 In a suitable example of use, the first speaker is a professional reader, narrator, actor, voice actor, etc. The second speaker is, for example, the parent, sibling, grandparent, friend, etc. of the user listening to the content, or the creator of the content, such as a writer. By generating reading data using such voice by the data processing device 1, the user can enjoy a distinctive reading by a professional narrator, etc., in the voice of someone familiar to the user. Reading using the data processing device 1 is effective, for example, in cases where you want a baby to remember the voice of its parent, or to feel reassured by having the baby hear the voice of its parent.

情報端末２は、データ処理装置１から入力された読上データに対応する音声を出力するためのデバイスである。情報端末２は、例えば、パーソナルコンピュータ、スマートスピーカー、スマートフォン又はタブレット等である。なお、データ処理装置１と情報端末２とは一体に構成されていてもよい。 The information terminal 2 is a device for outputting audio corresponding to the reading data input from the data processing device 1. The information terminal 2 is, for example, a personal computer, a smart speaker, a smartphone, or a tablet. The data processing device 1 and the information terminal 2 may be configured as one unit.

データ処理装置１は、第１話者Ｓ１がコンテンツを朗読した音声の特徴を示す朗読情報を記憶している。コンテンツは例えば絵本、小説、漫画又は教養書等の書籍、演劇等の台本若しくは紙芝居等である。朗読情報は、第１話者Ｓ１がコンテンツを朗読した音声の特徴量を示す時系列データである。朗読情報は、コンテンツの再生時間における時刻と、該時刻において発声される文字と、音量と、音高と、が関連づけられている。一例として、データ処理装置１は、第１話者Ｓ１がコンテンツを朗読した音声データを取得し、取得した音声データの声量、音高、発声されている文字を解析することで朗読情報を生成する。朗読情報は、コンテンツを構成する文字が、コンテンツの再生の開始からどれ位経過した時点でどのような音量及び音高で発声されるかを示すデータともいえる。 The data processing device 1 stores recitation information indicating characteristics of the voice of the first speaker S1 reciting the content. The content is, for example, a picture book, a novel, a manga, or a book such as a textbook, a script for a play, or a paper theater. The recitation information is time-series data indicating characteristics of the voice of the first speaker S1 reciting the content. The recitation information associates a time in the playback time of the content with the characters spoken at that time, the volume, and the pitch. As an example, the data processing device 1 acquires audio data of the content reciting the content by the first speaker S1, and generates the recitation information by analyzing the voice volume, pitch, and spoken characters of the acquired audio data. The recitation information can also be said to be data indicating the volume and pitch at which the characters constituting the content are spoken and at what time after the start of playback of the content.

なお、データ処理装置１は、異なる第１話者が同じコンテンツを朗読して生成された複数の朗読情報を記憶してもよい。異なる第１話者が同じコンテンツを朗読した朗読情報を記憶しておくことで、個性のある朗読の中から状況に適した朗読を選択することが可能となる。また、同じ話者が異なるコンテンツを朗読して生成された複数の朗読情報を記憶してもよい。 The data processing device 1 may store multiple pieces of reading information generated by different first speakers reading the same content. Storing reading information in which different first speakers read the same content makes it possible to select a reading that is appropriate for the situation from among unique readings. Also, multiple pieces of reading information generated by the same speaker reading different content may be stored.

データ処理装置１は、第１話者Ｓ１と異なる第２話者Ｓ２の声質を示す発声情報を記憶している。発声情報は、複数の文字と、第２話者Ｓ２が複数の文字それぞれを発音する際に発声された声の音色と、が関連づけられている。一例として、データ処理装置１は、第２話者Ｓ２が発声した音声をサンプリングした音声データを取得し、取得した音声データの音高、音色、発声されている文字を解析することで発声情報を生成する。 The data processing device 1 stores voice information indicating the voice quality of a second speaker S2, who is different from the first speaker S1. The voice information associates a plurality of characters with the timbre of the voice uttered by the second speaker S2 when pronouncing each of the plurality of characters. As an example, the data processing device 1 acquires voice data obtained by sampling the voice uttered by the second speaker S2, and generates the voice information by analyzing the pitch, timbre, and spoken characters of the acquired voice data.

ユーザＵは、情報端末２を操作し、聴取したいコンテンツと第２話者を選択する。情報端末２は、聴取したいコンテンツと第２話者を含む操作情報をデータ処理装置１に送信する。データ処理装置１は、操作情報に含まれる聴取したいコンテンツに対応する朗読情報と、第２話者に対応する発声情報を取得する。 The user U operates the information terminal 2 to select the content to be listened to and the second speaker. The information terminal 2 transmits operation information including the content to be listened to and the second speaker to the data processing device 1. The data processing device 1 acquires recitation information corresponding to the content to be listened to and speech information corresponding to the second speaker, which are included in the operation information.

データ処理装置１は、取得した朗読情報と発声情報とを合成して読上データを生成する。具体的には、データ処理装置１は、朗読情報に含まれる時刻それぞれにおける文字に発声情報において対応する音色と、朗読情報が示す該時刻における音量と音高とからなる音を時系列に示す読上データを生成する。 The data processing device 1 generates reading data by synthesizing the acquired reading information and vocalization information. Specifically, the data processing device 1 generates reading data that indicates, in a time series, a sound consisting of a tone color in the vocalization information that corresponds to the character at each time included in the reading information, and the volume and pitch at the time indicated by the reading information.

そして、データ処理装置１は、読上データを情報端末２に出力する。データ処理装置１がこのように構成されることで、所望する音声で他の朗読者の間や抑揚のある朗読を再現したテキストの読み上げをすることができる。 Then, the data processing device 1 outputs the reading data to the information terminal 2. By configuring the data processing device 1 in this way, it is possible to read the text aloud in a desired voice, reproducing the pauses and intonation of another reader's reading.

［データ処理装置１の構成］
図２は、データ処理装置１の構成を示すブロック図である。データ処理装置１は、通信部１１、記憶部１２及び制御部１３を有する。記憶部１２は、朗読情報記憶部１２１、発声情報記憶部１２２、方言情報記憶部１２３及び反応情報記憶部１２４を有する。制御部１３は、操作受付部１３１、生成部１３２、出力制御部１３３、撮像データ取得部１３４、判定部１３５、音声情報取得部１３６及び音声認識部１３７を有する。 [Configuration of data processing device 1]
2 is a block diagram showing the configuration of the data processing device 1. The data processing device 1 has a communication unit 11, a storage unit 12, and a control unit 13. The storage unit 12 has a recitation information storage unit 121, an utterance information storage unit 122, a dialect information storage unit 123, and a reaction information storage unit 124. The control unit 13 has an operation acceptance unit 131, a generation unit 132, an output control unit 133, an imaging data acquisition unit 134, a determination unit 135, a voice information acquisition unit 136, and a voice recognition unit 137.

通信部１１は、他の装置とデータの送受信するための通信インターフェースである。記憶部１２は、例えば、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）、ＳＳＤ（Solid State Drive）、ＨＤＤ（Hard Disk Drive）等の記憶媒体である。記憶部１２は、制御部１３が実行する各種のプログラムを記憶する。 The communication unit 11 is a communication interface for transmitting and receiving data to and from other devices. The storage unit 12 is a storage medium such as a ROM (Read Only Memory), a RAM (Random Access Memory), an SSD (Solid State Drive), or an HDD (Hard Disk Drive). The storage unit 12 stores various programs executed by the control unit 13.

朗読情報記憶部１２１は、文字列から構成されるコンテンツを朗読する第１話者の音声の特徴量を示す時系列データである朗読情報であって、コンテンツの再生時間における時刻と、該時刻において発声される文字と、音量と、音高と、を関連づけた朗読情報を記憶する。図３は、朗読情報記憶部１２１が記憶する朗読情報のデータ構造の一例を示す図である。図３に示す朗読情報においては、「時刻」と、「文字」と、「音量」と、「音高」と、が関連付けられている。 The reading information storage unit 121 stores reading information, which is time-series data indicating the characteristics of the voice of a first speaker who reads content consisting of character strings, and which associates a time in the playback time of the content with the characters spoken at that time, the volume, and the pitch. Figure 3 is a diagram showing an example of the data structure of reading information stored in the reading information storage unit 121. In the reading information shown in Figure 3, "time", "characters", "volume", and "pitch" are associated with each other.

朗読情報における「時刻」は、コンテンツの再生時間中における該データが位置する時刻を示す。「時刻」は、一例としてミリ秒又はマイクロ秒単位で表される。なお、「時刻」は、タイミングを示す指数で示されてもよい。この場合、サンプリング周期と指数を乗算し、時刻を求めることができる。 The "time" in the reading information indicates the time at which the data is located during the playback time of the content. As an example, the "time" is expressed in milliseconds or microseconds. The "time" may also be expressed as an index indicating the timing. In this case, the time can be obtained by multiplying the sampling period by the index.

朗読情報における「文字」は、当該時刻において発声される文字を示す。図３においては一例として、「文字」は発声される音に対応するかな文字で示されている。かな文字で示される場合、「文字」においては、子音を発音するタイミングに所定のフラグ（図３においては「！」）が挿入されている。こうすることで、発声される音が子音を含む音か母音かが区別される。「文字」は、かな文字ではなく、発声される母音と子音とにより示されてもよい。「文字」における「－」は当該時刻において発声される文字がないことを示す。すなわち、当該時刻は第１話者が朗読において間を取ったことを示している。 "Characters" in the reading information indicate the characters that are spoken at that time. In FIG. 3, as an example, "characters" are shown as kana characters that correspond to the sounds that are spoken. When shown as kana characters, a specified flag ("!" in FIG. 3) is inserted in the "characters" at the timing when the consonants are pronounced. This distinguishes whether the sound that is spoken is a sound that includes a consonant or a vowel. "Characters" may be shown by spoken vowels and consonants instead of kana characters. A "-" in "characters" indicates that no characters are spoken at that time. In other words, that time indicates that the first speaker has taken a pause in the reading.

朗読情報における「音量」は関連付けられた当該時刻において発声される音の大きさを示す。一例として、「音量」は基準となる音量に対する比であらわされてもよいし、音圧の大きさ（ｄＢ）で表されてもよい。「音高」は当該時刻において発声される文字の音高（ピッチ）を表す。「音高」はパルスの周期で示されてもよいし、周波数で示されてもよい。 The "volume" in the reading information indicates the loudness of the sound spoken at the associated time. As an example, the "volume" may be expressed as a ratio to a reference volume, or as the sound pressure (dB). The "pitch" indicates the pitch of the characters spoken at the associated time. The "pitch" may be indicated by the pulse period or by the frequency.

朗読情報記憶部１２１は、朗読情報を構成する単語と、単語を構成する１以上の文字と、をさらに関連付けた朗読情報を記憶する。図３に示す朗読情報においては、「単語」がさらに関連付けられている。「単語」においては、当該時刻に発声される「文字」がどの単語の何番目の文字を示しているかを表している。一例として「時刻」１から５における「む」には単語ＩＤのＭ０１に対応する「むかし」という単語の一番目の文字であること、時刻６から１１における「か」には「むかし」という単語の２番目の文字であることを示す情報が格納されている。単語ＩＤは、単語を識別する情報である。 The reading information storage unit 121 stores reading information that further associates words that make up the reading information with one or more characters that make up the word. In the reading information shown in FIG. 3, a "word" is further associated. The "word" indicates which word and which character the "character" uttered at the relevant time indicates. As an example, information is stored indicating that "mu" at "time" 1 to 5 is the first character of the word "mukashi" corresponding to word ID M01, and that "ka" at times 6 to 11 is the second character of the word "mukashi". The word ID is information that identifies a word.

朗読情報記憶部１２１は、時刻と、フレーズを挿入するタイミングを示すフラグを関連付けた朗読情報を記憶する。図３に示す朗読情報においては、「フレーズ挿入」がさらに関連付けられている。「フレーズ挿入」は、各時刻において所定のフレーズを挿入するタイミングを示すフラグであり、「１」の場合はフレーズを挿入するタイミングであることを示す。また、朗読情報においては、コンテンツの場面に応じて挿入すべきフレーズの種類を示すフラグが関連付けられていてもよい。 The reading information storage unit 121 stores reading information that associates a time with a flag indicating the timing for inserting a phrase. In the reading information shown in FIG. 3, "phrase insertion" is further associated. "Phrase insertion" is a flag indicating the timing for inserting a specific phrase at each time, and "1" indicates that it is the timing for inserting a phrase. In addition, the reading information may be associated with a flag indicating the type of phrase to be inserted depending on the scene of the content.

朗読情報記憶部１２１は、コンテンツを複数の異なる第１話者それぞれが朗読した複数の朗読情報それぞれと、朗読情報それぞれが適する状況とを関連付けて記憶してもよい。
朗読情報記憶部１２１は、朗読情報と関連付けて朗読情報のメタデータを記憶している。図４は、朗読情報に関連付けられたメタデータのデータ構造の一例を示す図である。朗読情報のメタデータは、朗読情報を識別する「朗読情報ＩＤ」、朗読されるコンテンツを識別する「コンテンツ」、当該コンテンツを朗読した第２話者を識別する「話者ＩＤ」情報、朗読情報が適する状況を示す「特徴」がメタデータとして関連付けられている。「特徴」は、例えば「明るい」、「眠たくなる」、「落ち着く」、「盛り上がる」等のそれぞれの朗読が適する状況を示す情報が含まれている。 The reading information storage unit 121 may store a plurality of pieces of reading information in which the content is read by a plurality of different first speakers in association with a situation in which each piece of reading information is suitable.
The reading information storage unit 121 stores metadata of the reading information in association with the reading information. Fig. 4 is a diagram showing an example of the data structure of the metadata associated with the reading information. The metadata of the reading information is associated as metadata with a "reading information ID" that identifies the reading information, "content" that identifies the content to be read aloud, "speaker ID" information that identifies the second speaker who read the content, and "characteristics" that indicate the situation in which the reading information is suitable. The "characteristics" include information that indicates the situation in which each reading is suitable, such as "cheerful", "sleepy", "calming", and "exciting".

発声情報記憶部１２２は、第２話者が発声した音声をサンプリングして生成された発声情報であって、複数の文字と、複数の文字それぞれを発音する際に発声された声の音色と、を関連づけて記憶した発声情報を記憶する。図５は、発声情報記憶部１２２が記憶する発声情報のデータ構造の一例を示す図である。発声情報は、話者を識別する情報と、文字と、当該文字の音色に対応する音の波形と、が関連付けられている。一例として、図５における音の波形は、それぞれの文字に対応する波形を時系列に記録した情報が記憶されている。 The vocalization information storage unit 122 stores vocalization information generated by sampling the voice uttered by the second speaker, in which multiple characters are associated with the tone of the voice uttered when pronouncing each of the multiple characters. FIG. 5 is a diagram showing an example of the data structure of vocalization information stored in the vocalization information storage unit 122. The vocalization information associates information identifying the speaker, characters, and sound waveforms corresponding to the tone of the characters. As an example, the sound waveforms in FIG. 5 store information that records the waveforms corresponding to each character in chronological order.

方言情報記憶部１２３は、複数の単語と、複数の単語それぞれが対応する方言と、方言を構成する１以上の文字と、方言を構成する文字を発音するための音高と、を関連付けた方言情報を記憶する。図６は、方言情報記憶部１２３が記憶する方言情報のデータ構造の一例を示す図である。方言情報は、「方言の種類」と、「単語ＩＤ」と、「単語」と、「方言」と、「文字」と「音高」と、が関連付けられている。「方言の種類」は、例えば関西弁、沖縄弁等を示す。「方言」は、「方言の種類」が示す方言において、「単語」に対応する方言を示す。一例として、「むかし」には「んかし」が対応することが記憶されている。「文字」は「方言」を構成する文字を示す。「んかし」は、「ん」、「か」及び「し」により構成されることが記憶されている。「音高」は当該方言を発声する際に各文字につけるべき音高を示す。 The dialect information storage unit 123 stores dialect information that associates a plurality of words, a dialect to which each of the plurality of words corresponds, one or more characters that constitute the dialect, and a pitch for pronouncing the characters that constitute the dialect. FIG. 6 is a diagram showing an example of the data structure of dialect information stored in the dialect information storage unit 123. In the dialect information, a "dialect type", a "word ID", a "word", a "dialect", a "character", and a "pitch" are associated. The "dialect type" indicates, for example, Kansai dialect, Okinawa dialect, etc. The "dialect" indicates the dialect that corresponds to the "word" in the dialect indicated by the "dialect type". As an example, it is stored that "mukashi" corresponds to "nkashi". The "character" indicates the character that constitutes the "dialect". It is stored that "nkashi" is composed of "n", "ka", and "shi". The "pitch" indicates the pitch that should be given to each character when pronouncing the dialect.

制御部１３は、例えばＣＰＵ（Central Processing Unit）である。制御部１３は、記憶部１２に記憶されている制御プログラムを実行することにより、操作受付部１３１、生成部１３２、出力制御部１３３、撮像データ取得部１３４、判定部１３５、音声情報取得部１３６及び音声認識部１３７として機能する。 The control unit 13 is, for example, a CPU (Central Processing Unit). The control unit 13 executes a control program stored in the storage unit 12, thereby functioning as an operation reception unit 131, a generation unit 132, an output control unit 133, an imaging data acquisition unit 134, a determination unit 135, a voice information acquisition unit 136, and a voice recognition unit 137.

［音声合成処理］
操作受付部１３１は、情報端末２からユーザの操作に対応する操作情報を受け付ける。一例として、操作受付部１３１は、ユーザが選択したコンテンツを含む再生指示を受け付ける。 [Speech synthesis processing]
The operation acceptance unit 131 accepts operation information corresponding to a user's operation from the information terminal 2. As an example, the operation acceptance unit 131 accepts a playback instruction including a content selected by the user.

生成部１３２は、朗読情報の時刻それぞれにおける文字に発声情報において対応する音色と該時刻における音量と音高とからなる音として出力させる読上データを生成する。生成部１３２は、ユーザが選択したコンテンツに対応する朗読情報と、発声情報と、を取得する。まず、生成部１３２は、取得した朗読情報それぞれの時刻における「文字」に、発声情報において対応する文字の音の波形を取得する。すなわち、ある時刻における朗読情報の文字が「む」である場合、発声情報における「む」の波形を取得する。そして、取得した音の波形を朗読情報の当該時間における音高、音量に基づいて波形のピッチと波の強さを増減させる。これを朗読情報に含まれる各時刻における文字について繰り返すことで、生成部１３２は、読上データを生成する。 The generation unit 132 generates reading data that outputs the characters at each time in the reading information as a sound consisting of a tone corresponding to the voice information and the volume and pitch at that time. The generation unit 132 acquires reading information and voice information corresponding to the content selected by the user. First, the generation unit 132 acquires the sound waveform of the character in the voice information that corresponds to the "character" at each time in the acquired reading information. That is, if the character in the reading information at a certain time is "mu", the generation unit 132 acquires the waveform of "mu" in the voice information. Then, the generation unit 132 increases or decreases the pitch and wave strength of the acquired sound waveform based on the pitch and volume at that time in the reading information. By repeating this for the characters at each time included in the reading information, the generation unit 132 generates reading data.

出力制御部１３３は、読上データを出力するよう制御する。出力制御部１３３は、一例として、読上データと読上データを音声出力する指示とを情報端末２に送信する。 The output control unit 133 controls the output of the reading data. As an example, the output control unit 133 transmits the reading data and an instruction to output the reading data as audio to the information terminal 2.

ところで、例えば第１話者と第２話者の声の音高が大きく異なる場合、第１話者の音高で第２話者の音色を再現すると、不自然な朗読となってしまう。そこで、データ処理装置１が第１話者と第２話者の音高の違いに基づいて出力する音高を調整するように構成されることで自然な朗読を提供することができる。 However, for example, if the pitch of the voices of the first and second speakers differs significantly, reproducing the timbre of the second speaker using the pitch of the first speaker will result in an unnatural reading. Therefore, the data processing device 1 can be configured to adjust the output pitch based on the difference in pitch between the first and second speakers, thereby providing a natural reading.

この場合、朗読情報記憶部１２１は、コンテンツにおいて基準となる音高を示す第１基準音高データをさらに関連付けた朗読情報を記憶する。図４に示す朗読情報のメタデータは朗読情報に関連付けて第１基準音高データを示す「基準音高」を含む。第１基準音高データは、一例として、第１話者が当該コンテンツを朗読した音声の音高の平均値又は中央値である。 In this case, the reading information storage unit 121 stores the reading information further associated with first reference pitch data indicating the reference pitch in the content. The metadata of the reading information shown in FIG. 4 includes "reference pitch" indicating the first reference pitch data in association with the reading information. The first reference pitch data is, as an example, the average or median pitch of the voice of the first speaker reading the content.

また、発声情報記憶部１２２は第２話者の基準となる音高を示す第２基準音高データをさらに記憶してもよい。第２基準音高データは、一例として、第２話者の音声を収録する際に記録した音高の平均値、中央値である。 The speech information storage unit 122 may further store second reference pitch data indicating the reference pitch of the second speaker. As an example, the second reference pitch data is the average and median of the pitches recorded when recording the voice of the second speaker.

生成部１３２は、朗読情報の時刻それぞれにおける文字に発声情報において対応する音色を、該時刻における音高と、第１基準音高データと第２基準音高データとの比に基づいて決定した音高で出力させる読上データを生成する。生成部１３２は、第１基準音高データと第２基準音高データとの比を算出する。そして、生成部１３２は、朗読データの各時刻における音高に算出した比を乗算することで、読上データの当該時刻において発声されるべき音高を決定する。データ処理装置１がこのように構成されることで、第１話者と第２話者の音高の差に鑑みた自然な抑揚をつけた朗読をすることができる。 The generation unit 132 generates reading data that outputs a tone in the vocalization information that corresponds to a character at each time in the reading information, at a pitch determined based on the pitch at that time and the ratio between the first standard pitch data and the second standard pitch data. The generation unit 132 calculates the ratio between the first standard pitch data and the second standard pitch data. The generation unit 132 then multiplies the pitch at each time of the reading data by the calculated ratio to determine the pitch to be spoken at that time of the reading data. By configuring the data processing device 1 in this way, reading can be performed with a natural intonation that takes into account the difference in pitch between the first and second speakers.

ところで、朗読を聴取するユーザに第２話者を視覚的に見せることで、ユーザに第２話者が実際に朗読しているように感じさせることができる。この場合、発声情報記憶部１２２は、第２話者に対応する画像データをさらに記憶する。図５に示す発声情報においては、該発声情報に含まれる音声を提供した第２話者を被写体として含む画像データである「話者画像」が関連付けられている。画像データは、例えば、静止画、動画、３次元画像、ＡＲ（Augmented Reality）画像、ＶＲ（Virtual Reality）画像である。 By visually showing the second speaker to the user listening to the reading, the user can feel as if the second speaker is actually reading. In this case, the utterance information storage unit 122 further stores image data corresponding to the second speaker. The utterance information shown in FIG. 5 is associated with a "speaker image," which is image data including, as a subject, the second speaker who provided the voice included in the utterance information. The image data is, for example, a still image, a video, a three-dimensional image, an AR (Augmented Reality) image, or a VR (Virtual Reality) image.

出力制御部１３３は、読上データを出力するよう制御している場合に第２話者に対応する画像を表示部に表示させるよう制御してもよい。生成部１３２は、選択された第２話者に対応する発声情報に関連付けられた画像データを取得する。出力制御部１３３は、取得した画像データを情報端末２の表示部に表示させる。出力制御部１３３は、一例として、画像データに含まれる人物の口元の画像が読上データの出力と連動して変化するように出力してもよい。このように構成されることで、ユーザは第２話者が実際に話しているような感覚を得ることができる。 When controlling to output the reading data, the output control unit 133 may control to display an image corresponding to the second speaker on the display unit. The generation unit 132 acquires image data associated with the speech information corresponding to the selected second speaker. The output control unit 133 displays the acquired image data on the display unit of the information terminal 2. As an example, the output control unit 133 may output an image of the mouth of a person contained in the image data so that the image changes in conjunction with the output of the reading data. This configuration allows the user to get the feeling that the second speaker is actually speaking.

［朗読情報の置換］
朗読を読み上げる音色を発声情報に基づいて変化させる例について説明したが、朗読の内容に変化をつけることで、ユーザをより楽しませることができる。そこで、データ処理装置１は、朗読情報の一部を方言に置き換えて朗読させてもよいし、朗読の途中にフレーズを挿入するよう構成されてもよい。 [Replacement of reading information]
Although an example of changing the tone of the reading based on the vocalization information has been described, the user can be more entertained by varying the content of the reading. Therefore, the data processing device 1 may read a part of the reading information by replacing it with a dialect, or may be configured to insert a phrase in the middle of the reading.

方言による朗読について説明する。この場合、操作受付部１３１は、一例として方言による朗読を行うか否かを示すフラグと、ユーザが選択した方言を含む操作情報を取得する。 The following describes reading in a dialect. In this case, the operation reception unit 131 acquires operation information including, as an example, a flag indicating whether or not to perform reading in a dialect and the dialect selected by the user.

そして、生成部１３２は、朗読情報に含まれる１以上の文字と１以上の文字に対応する音高とを、１以上の文字それぞれが構成する単語に方言情報において対応する方言に含まれる文字と方言に含まれる文字を発音するための音高とで置換した置換朗読情報をさらに生成し、生成した置換朗読情報の時刻それぞれにおける置換後の文字に発声情報において対応する音色と該時刻における音量と置換後の音高とからなる音として出力させる読上データを生成する。操作受付部１３１が取得したユーザの操作内容が方言による朗読を選択したことを示す場合、生成部１３２は、方言情報をさらに取得する。そして、生成部１３２は、取得した朗読情報の単語と、方言情報に含まれる単語と、を比較し、合致する単語を方言情報に含まれる方言に置換する。一例として、朗読情報に含まれる「むかし」の単語を朗読情報において対応する方言である「んかし」に置換する。 The generating unit 132 then generates replacement reading information in which one or more characters and a pitch corresponding to the one or more characters included in the reading information are replaced with characters included in a dialect in the dialect information corresponding to a word constituted by each of the one or more characters and a pitch for pronouncing the characters included in the dialect, and generates reading data for outputting the replaced characters at each time of the generated replacement reading information as a sound consisting of a tone corresponding to the vocalization information, a volume at that time, and a replaced pitch. If the user's operation content acquired by the operation receiving unit 131 indicates that a reading in a dialect has been selected, the generating unit 132 further acquires dialect information. The generating unit 132 then compares the words in the acquired reading information with the words included in the dialect information, and replaces the matching words with the dialect included in the dialect information. As an example, the word "mukashi" included in the reading information is replaced with "nkashi", which is the corresponding dialect in the reading information.

そして、置換した箇所の音高を方言情報の音高で出力させる読上データを生成する。朗読情報に含まれる単語と、方言情報において対応する方言の文字数が一致しない場合、一例として、当該単語を読み上げる時間が一致するように単語と方言とを構成する文字数の比に応じて各文字を読み上げる時間を短縮又は延長させてもよい。 Then, reading data is generated to output the pitch of the replaced portion at the pitch of the dialect information. If the number of characters of a word included in the reading information does not match that of the corresponding dialect in the dialect information, as an example, the time to read each character may be shortened or extended depending on the ratio of the number of characters that make up the word and the dialect so that the time to read the word matches.

データ処理装置１が朗読情報に含まれる単語の一部を方言に置き換えた読上データを生成するよう構成されることで、例えば、第２話者が方言を話す場合において、の第２話者本来の話し方に近い読上げを行うことが可能となる。 By configuring the data processing device 1 to generate reading data in which some of the words included in the reading information are replaced with a dialect, for example, when the second speaker speaks in a dialect, it becomes possible to perform reading that is close to the second speaker's original speaking style.

次に、朗読の途中にフレーズを挿入させる例について説明する。生成部１３２は、フラグが示すタイミングに複数の所定のフレーズから選択したフレーズを出力させる読上データを生成させる。操作受付部１３１が受け付けたユーザの操作内容が所定のフレーズを挿入して朗読することを示す場合、生成部１３２は、朗読情報におけるフレーズを挿入するタイミングを示すフラグが付与されているタイミングに所定のフレーズに対応する音を第２話者の音色で出力させる読上データを生成する。記憶部１２は、所定のフレーズとして、例えば、「すごいね」、「面白いね」等の感想を伝えるフレーズや、「大丈夫かな？」「この後どうなるのかな？」等のような展開を予測させたりするフレーズを記憶している。なお、生成部１３２は、所定のフレーズからランダムに選択されたフレーズをフラグが付与されたタイミングに挿入した読上データを生成してもよい。 Next, an example of inserting a phrase in the middle of a reading will be described. The generating unit 132 generates reading data that outputs a phrase selected from a plurality of predetermined phrases at the timing indicated by the flag. When the user's operation content accepted by the operation accepting unit 131 indicates that a predetermined phrase is to be inserted and read aloud, the generating unit 132 generates reading data that outputs a sound corresponding to the predetermined phrase in the tone of the second speaker at the timing at which a flag indicating the timing at which the phrase is to be inserted in the reading information is attached. The storage unit 12 stores, as the predetermined phrase, phrases that convey impressions such as "That's amazing" or "That's interesting" and phrases that predict developments such as "Is everything okay?" or "I wonder what will happen next?". The generating unit 132 may generate reading data in which a phrase randomly selected from the predetermined phrases is inserted at the timing at which a flag is attached.

フレーズは例えば、「～かしら」、「～だぜ」のような口癖であってもよい。この場合、朗読情報においてフラグが付与されているタイミングに口癖を示すフレーズを挿入してもよい。また、朗読情報においてフラグが付与されているタイミングに対応する文字を、口癖を示すフレーズで置換してもよい。 The phrase may be, for example, a catchphrase such as "I wonder if..." or "It's...". In this case, the phrase indicating the catchphrase may be inserted at the timing when a flag is added in the reading information. Also, the characters corresponding to the timing when a flag is added in the reading information may be replaced with the phrase indicating the catchphrase.

朗読中のコンテンツの場面に適したフレーズが挿入されるようにデータ処理装置１が構成されてもよい。すなわち、生成部１３２は、朗読情報に付されたコンテンツの場面に応じて挿入すべきフレーズに対応するフレーズをフラグが付与されたタイミングに挿入する。この場合、各フレーズにはフレーズに対応する感情が関連付けられている。一例として、生成部１３２は、コンテンツの場面が明るい場面である場合は、「楽しいね」などの明るい感情を表すフレーズが挿入されてもよいし、コンテンツの場面が危機に陥っている状況である場合は、「大丈夫かな？」などの心配する感情を表すフレーズを挿入する。 The data processing device 1 may be configured to insert a phrase appropriate to the scene of the content being read aloud. That is, the generation unit 132 inserts a phrase corresponding to the phrase to be inserted according to the scene of the content attached to the reading information at the timing when the flag is attached. In this case, each phrase is associated with an emotion corresponding to the phrase. As an example, the generation unit 132 may insert a phrase expressing an upbeat emotion such as "It's fun" when the scene of the content is a bright scene, and insert a phrase expressing an anxious emotion such as "Is everything okay?" when the scene of the content is a crisis situation.

データ処理装置１がこのように構成されることで、変化をつけた朗読を出力させることが可能となり、ユーザをより楽しませることができる。
［ユーザの状況に応じた制御］ By configuring the data processing device 1 in this way, it becomes possible to output a varied reading, which can provide greater enjoyment to the user.
[Control according to user's situation]

コンテンツの朗読を聴取するユーザの状況に基づいて読上げの出力を制御してもよい。このように構成することで、例えば、子どもの入眠への導入としてデータ処理システムＳを用いて読み聞かせをする利用シーンにおいて、子どもが入眠した場合に読上げを停止したり、音量を徐々に小さくさせながら停止させたりすることができる。 The output of the reading may be controlled based on the situation of the user listening to the reading of the content. By configuring it in this way, for example, in a usage scenario in which the data processing system S is used to read to a child to help them fall asleep, the reading can be stopped when the child falls asleep, or the volume can be gradually reduced before stopping.

撮像データ取得部１３４は、ユーザを撮像した撮像データを取得する。撮像データ取得部１３４は、情報端末２の撮像手段が撮像した撮像データを取得する。判定部１３５は、撮像データ取得部１３４から取得した撮像データを既知の画像解析技術を用いて画像解析することでユーザの状態を判定する。判定部１３５は、一例として、取得した撮像データを画像認識することでユーザの感情や、朗読に集中しているかどうか、ユーザが感じている眠気の状態又や眠っているか否か等を判定してもよい。 The imaging data acquisition unit 134 acquires imaging data of the user. The imaging data acquisition unit 134 acquires imaging data captured by the imaging means of the information terminal 2. The determination unit 135 determines the state of the user by performing image analysis of the imaging data acquired from the imaging data acquisition unit 134 using a known image analysis technique. As an example, the determination unit 135 may determine the user's emotions, whether the user is concentrating on the recitation, the drowsiness the user is feeling, or whether the user is asleep, etc., by performing image recognition on the acquired imaging data.

出力制御部１３３は、判定部１３５が判定したユーザの状況に基づいて出力の態様を制御する。一例として、判定部１３５がユーザの状態をユーザが眠っていると判定した場合に、読上データの出力を停止し、又は読上データの出力態様を変更する。出力制御部１３３は、判定部１３５が、ユーザが眠っていると判定した場合又は眠気を感じていると判定した場合に、読上データを出力する音量を下げるように制御してもよいし、読上データの出力を停止させるよう制御してもよい。 The output control unit 133 controls the output mode based on the user's state determined by the determination unit 135. As an example, when the determination unit 135 determines that the user's state is asleep, the output of the reading data is stopped or the output mode of the reading data is changed. When the determination unit 135 determines that the user is asleep or feeling drowsy, the output control unit 133 may control the volume at which the reading data is output to be lowered or may control the output of the reading data to be stopped.

出力制御部１３３は、判定部１３５が判定したユーザの状況に基づいて読上げるコンテンツを他のコンテンツに切り替えてもよい。出力制御部１３３は、一例として、判定部１３５がユーザの状態をユーザが朗読に関心を示していないことを判定した場合、音楽などの他のコンテンツを出力させるよう制御してもよい。また、生成部１３２に他のコンテンツの朗読情報と選択された発声情報とから読上データを生成させ、読上データを出力するよう制御してもよい。 The output control unit 133 may switch the content to be read to other content based on the user's status determined by the determination unit 135. As an example, when the determination unit 135 determines that the user's status indicates that the user is not interested in reading aloud, the output control unit 133 may control the output of other content such as music. In addition, the output control unit 133 may control the generation unit 132 to generate reading data from the reading information of the other content and the selected vocalization information, and output the reading data.

コンテンツを聴取しているユーザに所定の状況が発生した場合に、ユーザの関係者に所定の状況が生じたことを通知するようデータ処理装置１が構成されてもよい。データ処理装置１は、例えば、朗読を聴取するユーザＵの保護者やユーザＵを介護する介護者に通知してもよい。 The data processing device 1 may be configured to notify related parties of a user who is listening to the content that a specific situation has occurred when the specific situation occurs. For example, the data processing device 1 may notify a guardian of the user U who is listening to the reading or a caregiver who is caring for the user U.

判定部１３５は、判定したユーザの状態を情報端末２に通知してもよい。判定部１３５は、例えば、取得した撮像データを画像解析した結果、ユーザが怒っている又は泣いている等の状態にあることを判定した場合に、所定の通知先へユーザの状態を通知するメッセージを送信する。 The determination unit 135 may notify the information terminal 2 of the determined state of the user. For example, when the determination unit 135 determines that the user is angry or crying as a result of image analysis of the acquired imaging data, the determination unit 135 transmits a message notifying a predetermined notification destination of the user's state.

ところで、朗読情報には第１話者が行った様々な個性のある朗読情報が記録されており、判定部１３５が判定したユーザの状況に基づいて適切な朗読情報を選択することで、ユーザはより朗読を楽しむことができる。 Incidentally, the reading information includes various unique reading information given by the first speaker, and the user can enjoy the reading even more by selecting appropriate reading information based on the user's situation determined by the determination unit 135.

判定部１３５は、ユーザの属性及びユーザの状態の少なくともいずれかに基づいてユーザの状況を判定する。例えば、予め登録されているユーザＵの年齢や性別等の属性に基づいてユーザの状況を判定してもよい。例えばユーザが幼児である場合、「盛り上がる」朗読情報を選択することが適切であると判定してもよい。また、既に説明したように判定部１３５は、取得した撮像データを画像認識することで、ユーザの状況を判定してもよい。 The determination unit 135 determines the user's situation based on at least one of the user's attributes and the user's state. For example, the user's situation may be determined based on attributes such as the age and gender of the user U that have been registered in advance. For example, if the user is a young child, it may be determined that it is appropriate to select "exciting" reading information. Furthermore, as already described, the determination unit 135 may determine the user's situation by performing image recognition on the acquired imaging data.

生成部１３２は、判定部１３５が判定したユーザの状況に関連付けられた朗読情報を選択し、選択した朗読情報の時刻それぞれにおける文字に発声情報において対応する音色と該時刻における音量と音高とからなる音として出力させる読上データを生成する。生成部１３２は、判定部１３５が判定したユーザの状況と合致する状況に関連付けられた朗読情報を取得し、取得した朗読情報と発声情報とに基づいて読上データを生成する。 The generation unit 132 selects the reading information associated with the user's situation determined by the determination unit 135, and generates reading data that outputs the characters at each time of the selected reading information as a sound consisting of a tone corresponding to the voice information and a volume and pitch at that time. The generation unit 132 acquires reading information associated with a situation that matches the user's situation determined by the determination unit 135, and generates reading data based on the acquired reading information and voice information.

［ユーザとのインタラクション］
ユーザの反応に基づいて出力を制御する例について説明したが、データ処理装置１がユーザの発話内容に対応する応答をするよう制御するよう構成されてもよい。 [User interaction]
Although an example in which the output is controlled based on the user's reaction has been described, the data processing device 1 may be configured to control the output so as to give a response corresponding to the content of the user's utterance.

この場合、朗読情報記憶部１２１は、朗読情報において朗読の対象となるコンテンツに含まれる言葉と言葉が示す意味とを対応付けた辞書情報を朗読情報と関連付けてさらに記憶する。図７は、朗読情報記憶部１２１が記憶する辞書情報のデータ構造の一例を示す図である。辞書情報においては、「コンテンツ」と「単語」と「意味」が関連付けられている。「コンテンツ」は単語に対応する意味が一般的な意味であるか、特定のコンテンツにおける意味であるかを示す。例えば、「コンテンツ」が「一般」である場合は、一般的な意味を指し、コンテンツを識別する情報（例えば「ももたろう」）が格納されている場合、そのコンテンツ特有の意味であることを示す。 In this case, the reading information storage unit 121 further stores, in association with the reading information, dictionary information that associates words included in the content to be read in the reading information with the meanings that the words indicate. FIG. 7 is a diagram showing an example of the data structure of dictionary information stored in the reading information storage unit 121. In the dictionary information, "content", "word", and "meaning" are associated. "Content" indicates whether the meaning corresponding to the word is a general meaning or a meaning in a specific content. For example, when "content" is "general", it indicates a general meaning, and when information that identifies the content (for example, "Momotaro") is stored, it indicates a meaning specific to that content.

音声情報取得部１３６は、コンテンツを視聴するユーザが発話した音声情報を取得する。音声情報取得部１３６は、情報端末２に搭載されたマイクが検出したユーザが発話した音声を示す音声情報を取得する。 The voice information acquisition unit 136 acquires voice information uttered by a user viewing the content. The voice information acquisition unit 136 acquires voice information indicating the voice uttered by the user detected by a microphone mounted on the information terminal 2.

音声認識部１３７は、音声情報取得部１３６が取得した音声情報を音声認識し、ユーザの発話内容を取得する。音声認識部１３７は、取得した発話内容を既知の自然言語処理技術を用いて形態素解析、構文解析及び意味解析を行い、ユーザの発話内容を分類する。音声認識部１３７は、一例として、発話内容を「質問」、「感情の表現」等に分類する。分類された発話内容が質問の場合、音声認識部１３７は、取得した発話内容を解析して質問されている内容を特定する。例えば、ユーザの発話内容が「黍団子って何？」である場合、質問内容が黍団子の意味であることを特定する。 The voice recognition unit 137 recognizes the voice information acquired by the voice information acquisition unit 136 and acquires the user's speech content. The voice recognition unit 137 performs morphological analysis, syntactic analysis, and semantic analysis on the acquired speech content using known natural language processing technology, and classifies the user's speech content. As an example, the voice recognition unit 137 classifies the speech content into "question", "expression of emotion", etc. If the classified speech content is a question, the voice recognition unit 137 analyzes the acquired speech content to identify the content being asked. For example, if the user's utterance content is "What is millet dumpling?", it identifies that the question is about the meaning of millet dumpling.

生成部１３２は、音声認識部１３７が取得したユーザの発話内容がコンテンツに対する質問である場合に、辞書情報を参照し、質問に対する回答を示す回答情報を生成し、出力制御部１３３は、回答情報を出力するよう制御する。生成部１３２は、ユーザの発話内容がコンテンツに対する質問である場合、辞書情報を検索し、音声認識部１３７が特定した質問の内容に対応する意味を取得する。生成部１３２は、辞書情報に記録された一般的な意味と朗読データを出力しているコンテンツ特有の意味とを検索対象として検索する。 When the user's utterance content acquired by the voice recognition unit 137 is a question about the content, the generation unit 132 refers to the dictionary information and generates answer information indicating the answer to the question, and the output control unit 133 controls to output the answer information. When the user's utterance content is a question about the content, the generation unit 132 searches the dictionary information and acquires the meaning corresponding to the question content identified by the voice recognition unit 137. The generation unit 132 searches between the general meaning recorded in the dictionary information and the meaning specific to the content for which the reading data is being output.

生成部１３２は、所定のフォーマットに取得した意味を当てはめることで回答文を生成する。そして、生成部１３２は、生成した回答文を第２話者に対応する発声情報で読上げる回答情報を生成し、情報端末２に出力する。 The generation unit 132 generates a response sentence by applying the acquired meaning to a predetermined format. The generation unit 132 then generates response information that reads the generated response sentence aloud using vocalization information corresponding to the second speaker, and outputs the generated response information to the information terminal 2.

さらに、ユーザが朗読に対してどのような反応をしたかを記録するよう構成されてもよい。音声認識部１３７は、ユーザの発話内容と、発話内容をユーザが発話したタイミングに対応する時刻と、を関連付けた反応情報を反応情報記憶部１２４に記憶させる。音声認識部１３７は、ユーザが発話した際に朗読していたコンテンツを識別する情報をさらに関連付けた反応情報を反応情報記憶部１２４に記憶させてもよい。データ処理装置１がこのように構成されることで、ユーザの朗読に対する反応を記録し、ユーザの思い出を残すことができる。 The data processing device 1 may further be configured to record how the user responded to the reading. The voice recognition unit 137 stores in the reaction information storage unit 124 reaction information that associates the content of the user's utterance with the time corresponding to when the user spoke the content. The voice recognition unit 137 may store in the reaction information storage unit 124 reaction information that is further associated with information that identifies the content that was being read when the user spoke. By configuring the data processing device 1 in this way, it is possible to record the user's reaction to the reading and preserve the user's memories.

［データ処理装置１における処理の流れ］
図８は、データ処理装置１における処理の流れを示すフローチャートである。図８に示すフローチャートは、朗読情報の選択を受け付ける準備ができた時点から開始している。操作受付部１３１は、第２話者と朗読対象のコンテンツとの選択を情報端末２から受け付ける（Ｓ１０１）。生成部１３２は、選択されたコンテンツに対応する朗読情報を朗読情報記憶部１２１から取得する（Ｓ１０２）。生成部１３２は、選択された第２話者に対応する発声情報を発声情報記憶部１２２から取得する（Ｓ１０３）。 [Processing flow in data processing device 1]
Fig. 8 is a flowchart showing the flow of processing in the data processing device 1. The flowchart shown in Fig. 8 starts from the point when the device is ready to accept the selection of the reading information. The operation accepting unit 131 accepts the selection of the second speaker and the content to be read from the information terminal 2 (S101). The generating unit 132 acquires the reading information corresponding to the selected content from the reading information storage unit 121 (S102). The generating unit 132 acquires the utterance information corresponding to the selected second speaker from the utterance information storage unit 122 (S103).

生成部１３２は、方言の選択を受付けたかを判定する（Ｓ１０４）。方言の選択を受付けた場合（Ｓ１０４におけるＹＥＳ）、生成部１３２は、朗読情報に含まれる単語を対応する方言に置換する（Ｓ１０５）。方言の選択を受付けていない場合（Ｓ１０４におけるＮＯ）、置換する処理をスキップする。 The generation unit 132 determines whether a dialect selection has been accepted (S104). If a dialect selection has been accepted (YES in S104), the generation unit 132 replaces the words included in the reading information with the corresponding dialect (S105). If a dialect selection has not been accepted (NO in S104), the generation unit 132 skips the replacement process.

生成部１３２は、朗読情報と発声情報とに基づいて読上データを生成する（Ｓ１０６）。そして、出力制御部１３３は、生成した読上データを情報端末２に出力するよう制御する（Ｓ１０７）。そして、データ処理装置１は、処理を終了する。 The generation unit 132 generates reading data based on the reading information and the vocalization information (S106). Then, the output control unit 133 controls the generated reading data to be output to the information terminal 2 (S107). Then, the data processing device 1 ends the process.

以上記載したようにデータ処理装置１が構成されることで、他の朗読者の間や抑揚のある朗読を所望する音声で再現したテキストの読み上げを出力することができる。 By configuring the data processing device 1 as described above, it is possible to output a reading of the text that reproduces the pauses between other readers and the reading with the desired intonation in the desired voice.

なお、本発明により、国連が主導する持続可能な開発目標（SDGs）の目標９「産業と技術革新の基盤をつくろう」に貢献することが可能となる。 Furthermore, this invention will make it possible to contribute to Goal 9 of the United Nations' Sustainable Development Goals (SDGs), which is "Build resilient infrastructure, promote inclusive and sustainable industrialization, and promote innovation and infrastructure."

以上、本発明を実施の形態を用いて説明したが、本発明の技術的範囲は上記実施の形態に記載の範囲には限定されず、その要旨の範囲内で種々の変形及び変更が可能である。例えば、装置の全部又は一部は、任意の単位で機能的又は物理的に分散・統合して構成することができる。また、複数の実施の形態の任意の組み合わせによって生じる新たな実施の形態も、本発明の実施の形態に含まれる。組み合わせによって生じる新たな実施の形態の効果は、もとの実施の形態の効果を併せ持つ。 Although the present invention has been described above using embodiments, the technical scope of the present invention is not limited to the scope described in the above embodiments, and various modifications and changes are possible within the scope of the gist of the invention. For example, all or part of the device can be configured by distributing or integrating functionally or physically in any unit. In addition, new embodiments resulting from any combination of multiple embodiments are also included in the embodiments of the present invention. The effect of the new embodiment resulting from the combination also has the effect of the original embodiment.

１データ処理装置
２情報端末
１１通信部
１２記憶部
１３制御部
１２１朗読情報記憶部
１２２発声情報記憶部
１２３方言情報記憶部
１２４反応情報記憶部
１３１操作受付部
１３２生成部
１３３出力制御部
１３４撮像データ取得部
１３５判定部
１３６音声情報取得部
１３７音声認識部 Reference Signs List 1 Data processing device 2 Information terminal 11 Communication unit 12 Storage unit 13 Control unit 121 Reading information storage unit 122 Vocalization information storage unit 123 Dialect information storage unit 124 Response information storage unit 131 Operation acceptance unit 132 Generation unit 133 Output control unit 134 Image capture data acquisition unit 135 Determination unit 136 Voice information acquisition unit 137 Voice recognition unit

Claims

a recitation information storage unit that stores recitation information which is time-series data indicating features of the voice of a first speaker reciting content consisting of character strings, the recitation information associating a time in the playback time of the content with a character uttered at the time, a volume, and a pitch;
a speech information storage unit that stores speech information generated by sampling a voice uttered by a second speaker, the speech information being stored in association with a plurality of characters and a tone of a voice uttered when pronouncing each of the plurality of characters;
a generating unit that generates reading data for outputting the characters at each of the times of the reading information as sounds having the tone color corresponding to the voice information in the utterance information and the volume and pitch at the times;
an output control unit that controls the reading data to be output;
A data processing device comprising:

the reading information storage unit stores the reading information further associated with first reference pitch data indicating a reference pitch in the content;
the utterance information storage unit further stores second reference pitch data indicating a reference pitch of the second speaker;
the generating unit generates the reading data for outputting the tone color in the utterance information corresponding to the character at each of the times of the reading information at a pitch determined based on a pitch at the time and a ratio between the first standard pitch data and the second standard pitch data.
2. A data processing apparatus according to claim 1.

a dialect information storage unit that stores dialect information that associates a plurality of words, a dialect corresponding to each of the plurality of words, one or more characters that constitute the dialect, and a pitch for pronouncing the character that constitutes the dialect,
The reading information storage unit stores the reading information further associated with a word constituting the reading information and one or more characters constituting the word,
The generating unit further generates replacement reading information in which the one or more characters and the pitch corresponding to the one or more characters included in the reading information are replaced with a character included in the dialect corresponding to the word constituted by each of the one or more characters in the dialect information and a pitch for pronouncing the character included in the dialect, and generates the reading data for outputting the replaced character at each of the times of the generated replacement reading information as a sound consisting of the tone corresponding to the speech information in the utterance information, the volume at the time, and the replaced pitch.
3. A data processing device according to claim 1 or 2.

the recitation information storage unit stores the recitation information in which the time is associated with a flag indicating a timing for inserting a phrase;
the generating unit generates the reading data for outputting a phrase selected from a plurality of predetermined phrases at a timing indicated by the flag;
A data processing device according to any one of claims 1 to 3.

the speech information storage unit further stores image data corresponding to the second speaker;
the output control unit controls a display unit to display an image corresponding to the second speaker when the output control unit controls the reading data to be output;
A data processing device according to any one of claims 1 to 4.

an imaging data acquisition unit that acquires imaging data of an image of a user;
A determination unit that determines a state of the user by performing image analysis on the imaging data acquired from the imaging data acquisition unit,
The output control unit stops output of the reading data or changes an output mode of the reading data when the determination unit determines that the state of the user is asleep.
A data processing device according to any one of claims 1 to 5.

an imaging data acquisition unit that acquires imaging data of an image of a user;
a determination unit that determines a state of the user by performing image analysis on the imaging data acquired from the imaging data acquisition unit, and notifies an information terminal of the determined state of the user;
A data processing apparatus according to any one of claims 1 to 5, further comprising:

The reading information storage unit stores a plurality of pieces of reading information in which the content is read by a plurality of different first speakers, in association with a situation in which each piece of reading information is suitable;
The determination unit determines a situation of the user based on at least one of an attribute of the user and a state of the user;
The generation unit selects the reading information associated with the user's situation determined by the determination unit, and generates the reading data for outputting the characters at each of the times of the selected reading information as a sound having the tone corresponding to the voice information in the utterance information and the volume and pitch at the time.
8. A data processing device according to claim 6 or 7.

The reading information storage unit further stores dictionary information that associates words included in the content to be read aloud in the reading information with meanings of the words in the reading information, in association with the reading information;
The data processing device includes: a voice information acquisition unit that acquires voice information uttered by a user who is viewing content;
a voice recognition unit that recognizes the voice information acquired by the voice information acquisition unit and acquires the contents of the user's speech,
the generation unit, when the speech content of the user acquired by the voice recognition unit is a question about the content, refers to the dictionary information and generates answer information indicating an answer to the question;
The output control unit controls to output the answer information.
A data processing device according to any one of claims 1 to 5.

Further comprising a reaction information storage unit,
The data processing device according to claim 9 , wherein the voice recognition unit stores in the reaction information storage unit reaction information that associates a content of an utterance by the user with the time corresponding to a timing at which the content of the utterance was uttered by the user.

A step of acquiring recitation information executed by a computer, the recitation information being time-series data indicating features of the voice of a first speaker reciting content composed of character strings stored in a recitation information storage unit, the recitation information associating a time in the playback time of the content with a character spoken at that time, a volume, and a pitch;
acquiring speech information generated by sampling a voice uttered by a second speaker and stored in a speech information storage unit, the speech information being stored in association with a plurality of characters and a tone of a voice uttered when pronouncing each of the plurality of characters;
generating reading data for outputting the characters at each of the times in the reading information as sounds having the tone color corresponding to the voice information in the utterance information and the volume and pitch at the times;
A step of controlling to output the reading data;
A data processing method comprising the steps of:

acquiring, in a computer, recitation information which is time-series data indicating features of the voice of a first speaker who recites content composed of character strings stored in a recitation information storage unit, the recitation information associating a time in the playback time of the content with a character uttered at that time, a volume, and a pitch;
acquiring speech information generated by sampling a voice uttered by a second speaker and stored in a speech information storage unit, the speech information being stored in association with a plurality of characters and a tone of a voice uttered when pronouncing each of the plurality of characters;
generating reading data for outputting the characters at each of the times in the reading information as sounds having the tone color corresponding to the voice information in the utterance information and the volume and pitch at the times;
A step of controlling to output the reading data;
A program that executes the following.