WO2019059094A1

WO2019059094A1 - Speech processing method and speech processing device

Info

Publication number: WO2019059094A1
Application number: PCT/JP2018/034010
Authority: WO
Inventors: 嘉山　啓
Original assignee: ヤマハ株式会社
Priority date: 2017-09-25
Filing date: 2018-09-13
Publication date: 2019-03-28
Also published as: JP2019060941A

Abstract

A speech processing device includes a speech analysis unit for identifying a feature amount of a first speech represented by a first speech signal for each utterance period, and a response generation unit for generating a second speech signal representing a second speech with a feature amount corresponding to a change in the feature amount of the first speech during a plurality of utterance periods.

Description

Voice processing method and voice processing apparatus

The present invention relates to a technique suitable for speech interaction.

Conventionally, a technology of speech dialogue has been proposed which realizes dialogue with the user by reproducing the speech of the response to the speech by the user (for example, the answer to the question). For example, Patent Document 1 discloses a technique for analyzing the content of an utterance by speech recognition of the user's utterance voice, and synthesizing and reproducing a response voice according to the analysis result.

JP 2012-128440 A

However, under the existing technology including Patent Document 1, it is actually difficult to realize a natural speech dialogue that faithfully reflects the tendency of real human dialogue, and it is mechanical and unnatural. There is a problem that the user can feel an impression. In consideration of the above circumstances, the present invention aims to realize a natural voice dialogue.

In order to solve the above problems, a voice processing method according to a preferred aspect of the present invention specifies a feature of a first voice represented by a first voice signal for each pronunciation period, and A second audio signal representing a second audio of the feature amount according to the change of the audio feature amount is generated.

The voice processing apparatus according to a preferred aspect of the present invention is a voice analysis unit that specifies the feature quantity of the first voice represented by the first voice signal for each pronunciation period, and the features of the first voice in a plurality of pronunciation periods. And a response generator configured to generate a second audio signal representing a second audio of the feature amount according to the change in amount.

It is a block diagram which shows the structure of the voice interactive apparatus which concerns on 1st Embodiment. It is a flowchart which shows operation | movement of a voice interactive apparatus. It is explanatory drawing which shows the relationship between the prosody of speech speech and the prosody of response speech. It is a graph which shows the relationship between a prosody change index and the change amount of the prosody of response speech. It is explanatory drawing of the prosody of the response speech in, when a pronunciation interval is made into a prosody. It is explanatory drawing of the prosody of the response speech in, when a pronunciation length is made into a prosody. It is a flowchart which shows operation | movement of the voice interactive apparatus in 2nd Embodiment.

First Embodiment
FIG. 1 is a block diagram of the voice interactive apparatus 100 according to the first embodiment of the present invention. The voice interactive apparatus 100 according to the first embodiment is a computer system that reproduces a voice (hereinafter referred to as “response voice”) Vy that responds to an input voice (hereinafter referred to as “speech voice”) Vx pronounced by the user U. . For example, a portable information processing apparatus such as a cellular phone or a smartphone, or an information processing apparatus such as a personal computer is used as the voice interaction apparatus 100. In addition, it is also possible to realize the voice interaction device 100 in the form of a toy (for example, a doll such as a stuffed animal) or a robot that simulates the appearance of an animal or the like.

The speech sound Vx is, for example, the speech of a speech including a question (question) and a speech, and the response speech Vy is a speech of a response including a response to a question or a response to a speech. The response speech Vy includes, for example, speech that means interjections. Interjections are independent words (touch verbs or exclamations) that are used independently of other segments and not used. Specifically, words such as "un" and "ee" which indicate the reciprocity to the utterance ("aha" or "right" in English), "h" Words such as ... and (eto) and "that" (ano) ("um" or "er" in English), "hai" and "no (iie)" that represent responses (positive or negative to a question) And so on ("yes" or "no" in English), "a (aa)" and "o (oo)" (in English, "ah" or "woo") to indicate the impression of the speaker. And words such as "e? (E)" or "nani" ("in English", "pardon?" Or "sorry?") Meaning interrogation as an interjection. Be done.

The voice interaction device 100 according to the first embodiment is a voice processing device that generates a response voice Vy (example of a second voice) of a feature amount according to the feature amount of the speech voice Vx (example of a first voice). The feature amount is, for example, prosody (prosody). Prosody is a linguistic and phonetic characteristic that can be perceived by the listener of speech, meaning that it can not be grasped only from the general representation of the language (for example, the representation excluding the special notation for prosody). Do. The prosody is also referred to as a characteristic that allows the listener to recall or guess the speaker's intention or emotion. Specifically, it includes intonation (change in tone of speech or intonation), tone (high or low or strong in voice), length (speech length), speech speed, rhythm (structure of temporal change in tone), or accent ( Although various features such as high and low or strong and weak accents are included in the concept of prosody, typical examples of prosody are pitch (fundamental frequency) or volume.

As illustrated in FIG. 1, the voice interaction device 100 according to the first embodiment includes a control device 20, a storage device 22, a voice input device 24, and a reproduction device 26. The voice input device 24 is an element that generates, for example, a voice signal (hereinafter referred to as a "voice signal") X representing the voice voice Vx of the user U, and includes a sound collection device 242 and an A / D converter 244. The sound collection device 242 picks up the speech voice Vx (an example of the first speech signal) uttered by the user U and generates an analog speech signal representing the sound pressure fluctuation of the speech voice Vx. The A / D converter 244 converts the audio signal generated by the sound collection device 242 into a digital speech signal X.

The control device 20 is an arithmetic processing unit (for example, a CPU) that comprehensively controls each element of the voice interaction device 100. The control device 20 according to the first embodiment acquires the speech signal X supplied from the speech input device 24, and generates a response signal Y (exemplary second speech signal) representing a response speech Vy to the speech speech Vx. The reproduction device 26 is an element that reproduces the response voice Vy according to the response signal Y generated by the control device 20, and includes a D / A converter 262 and a sound emission device 264. The D / A converter 262 converts the digital response signal Y generated by the control device 20 into an analog audio signal, and the sound emitting device 264 (for example, a speaker or headphone) converts the response audio according to the converted audio signal. It emits Vy as a sound wave. The reproduction device 26 also includes a processing circuit such as an amplifier for amplifying the response signal Y. The speech signal X and the response signal Y are, for example, audio data in wav format.

The storage device 22 stores a program executed by the control device 20 and various data used by the control device 20. For example, a known recording medium such as a semiconductor recording medium or a magnetic recording medium, or a combination of a plurality of recording mediums is arbitrarily adopted as the storage device 22. The storage device 22 according to the first embodiment stores an audio signal Z representing a response voice of specific utterance content. The following description exemplifies a case where the voice signal Z of the response voice such as “Y”, which means “sumo” that is an example of interjection, is stored in the storage device 22. The audio signal Z is recorded in advance and stored in the storage device 22 as audio data of an arbitrary format such as wav format.

The control device 20 implements a plurality of functions (the voice analysis unit 34 and the response generation unit 36) for establishing a voice dialogue with the user U by executing the program stored in the storage device 22. A configuration in which the functions of control device 20 are realized by a plurality of devices (i.e., systems) or a configuration in which a dedicated electronic circuit realizes a part of the functions of control device 20 may be employed.

The speech analysis unit 34 specifies the prosody Px of the speech voice Vx from the speech signal X generated by the speech input device 24. The prosody Px is an acoustic feature that can be extracted from the speech signal X. The voice analysis unit 34 of the first embodiment sequentially specifies the prosody Px for each pronunciation period of the speech voice Vx. As described above, although plural types of prosody may be specified for the utterance voice Vx, the speech analysis unit 34 specifies the numerical value of the specific type of prosody Px required by the program being executed among the plurality of types. An arbitrary one pronunciation period is a series of periods which are grasped as one utterance (for example, an inquiry and a talk) by the user U. For example, the volume of the speech voice Vx continuously exceeds a predetermined threshold. It is a period. A speech period corresponding to one response may be defined as a pronunciation period. Specifically, the speech analysis unit 34 specifies a representative value (for example, an average value) of a plurality of prosody specified at a predetermined cycle in the sound generation period as the prosody Px of the sound generation period. Also, the prosody at a specific time (for example, an end point) within the pronunciation period may be specified as the prosody Px of the pronunciation period. The prosody Px may be specified from the point in time immediately before the last phoneme of the utterance voice Vx during the pronunciation period.

The response generator 36 generates a response signal Y representing the response voice Vy. Specifically, the response generation unit 36 generates a response signal Y representing the response speech Vy of the prosody Py corresponding to the temporal change of the prosody Px specified by the speech analysis unit 34. The change in prosody Px is an example of the “change in feature value”. As described above, since the prosody Px is specified for each pronunciation period, the temporal change of the prosody Px means the change of the prosody Px during the successive pronunciation periods, and the change of the prosody within one pronunciation period is not. The prosody Py is a feature of the same type as the prosody Px, but the numerical values are different. The response generation unit 36 of the first embodiment generates a response signal Y by adjusting the prosody Pz of the audio signal Z stored in the storage device 22 to the prosody Py. The response signal V generated by the response generation unit 36 is supplied to the reproduction device 26, whereby the response sound Vy is reproduced. That is, the response voice Vy obtained by adjusting the initial response voice represented by the voice signal Z in accordance with the prosody Px of the speech voice Vx is reproduced from the playback device 26.

FIG. 2 is a flowchart of processing executed by the control device 20 according to the first embodiment. For example, the process of FIG. 2 is started in response to an instruction from the user U (for example, an instruction to start a program for speech interaction) to the speech interaction apparatus 100. When the process of FIG. 2 is started, the speech analysis unit 34 analyzes the speech signal X generated by the speech input device 24 to specify the prosody Px for one pronunciation period Tx of the speech speech Vx (Sa1). Note that although the prosody Px basically has a numerical value determined at the end of the sound generation period Tx, the numerical value may be determined at a point in the middle of the sound generation period Tx. FIG. 3 illustrates the prosody Px_n calculated for the n-th pronunciation period Tx_n of the utterance voice Vx (n is a natural number). That is, FIG. 3 is an explanatory view of a process performed when the utterance (for example, an inquiry or a talk) of the sound generation period Tx_n by the user U is completed.

The response generation unit 36 calculates an index of change in prosody Px of the speech Vx (hereinafter referred to as “prosody change index”) Dx (Sa2). Specifically, as illustrated in FIG. 3, the speech analysis unit 34 calculates the prosody Px_n calculated for the latest pronunciation period Tx_n of the utterance voice Vx and the prosody Px_n− calculated for the immediately preceding pronunciation period Tx_n−1. The difference with 1 is calculated as a prosody change index Dx_n (Dx_n = Px_n−Px_n−1). That is, the prosody change index Dx_n is an index of a difference in prosody (a change in prosody between two consecutive utterances) between two adjacent utterance voices Vx.

The response generation unit 36 generates a response signal Y of the prosody Py corresponding to the prosody change index Dx_n (Sa3). Specifically, as illustrated in FIG. 3, the response generation unit 36 represents the response speech Vy of the prosody Py by changing the prosody Pz of the speech signal Z by the change amount Dy_n according to the prosody change index Dx_n. A response signal Y is generated. Note that, at the stage when the speech voice Vx in the first pronunciation period Tx_1 is produced, the difference in prosody Px can not be calculated for two successive pronunciation periods Tx. Therefore, the change amount Dy_1 is set to a predetermined initial value. In addition, the prosody change index Dx_n-1 corresponds to the difference between the prosody Px_n-1 calculated for the pronunciation period Tx_n-1 of the speech Vx and the prosody Px_n-2 calculated for the immediately preceding pronunciation period Tx_n-2. , Calculated in the same way as described above.

FIG. 4 is a graph showing the relationship between the prosody change index Dx and the change amount Dy (the difference between the prosody Pz and the prosody Py). The graph of FIG. 4 corresponds to a rule for determining the change amount Dy from the prosody change index Dx. As exemplified by a solid line in FIG. 4, the change amount Dy is determined so that the change amount Dy linearly increases with respect to the increase of the prosody change index Dx. For example, the change amount Dy is set to a numerical value equal to the prosody change index Dx. Therefore, when the prosody Px_n exceeds the prosody Px_n-1 (ie, the prosody Px of the speech Vx increases), the prosody Py of the response speech Vy is set to a value exceeding the prosody Pz of the speech signal Z. On the other hand, if the prosody Px_n is less than the prosody Px_n-1 (ie, if the prosody Px of the speech Vx is decreased), the prosody Py of the response speech Vy is set to a value less than the prosody Pz of the speech signal Z. The relationship between the prosody change index Dx and the change amount Dy is not limited to the above example. For example, as exemplified by the broken line in FIG. 4, the change amount Dy may be changed non-linearly with respect to the prosody change index Dx. Also, for example, the addition value of the prosody change index Dx_n and the initial value may be calculated as the change amount Dy_n. That is, the relationship between the prosody change index Dx and the change amount Dy may be a relationship in which the prosody Py of the response voice Vy is a prosody suitable for the prosody Px of the speech voice Vx.

As understood from the above description, under the predetermined rule illustrated in FIG. 4, the variation Dy is set which represents the degree to which the prosody Py of the response speech Vy is changed. That is, the change amount Dy for adjusting the prosody Py of the response speech Vy to be output immediately after is set from the prosody change index Dx indicating the change of the prosody Px of the speech voice Vx which is sequentially after each other. The prosody Py of the response speech Vy set by the above method is the prosody of the result of adjusting the prosody Pz of the speech signal Z to be in harmony with the utterance such as the inquiry or the speech.

The response generation unit 36 reproduces the response voice Vy by supplying the response signal Y generated by the above processing to the reproduction device 26 (Sa4). When the reproduction of the response voice Vy is completed, the control device 20 determines whether the end of the voice dialogue has been instructed by the user U (Sa5). When the end of the voice dialogue is not instructed (Sa5: NO), the control device 20 shifts the processing to step Sa1. As understood from the above description, the specification (Sa1) of the prosody Px of the speech Vx, the calculation of the prosody change index Dx (Sa2), and the generation of the response signal Y of the prosody Py according to the prosody change index Dx (Sa3 And the reproduction (Sa4) of the response voice Vy are repeated at every sound generation period Tx of the speech voice Vx. That is, the process from step Sa1 to step Sa4 is performed every time the user U pronounces the speech voice Vx (every time the speech signal X is input). Therefore, a voice dialogue is realized in which the pronunciation of the arbitrary speech voice Vx by the user U and the reproduction of the response speech Vy to the speech voice Vx are alternately repeated. The processing from step Sa1 to step Sa4 is sequentially performed for each utterance period Tx during speech (input) by the user U, and corresponds to an operation of generating a response to one utterance voice Vx.

As described above, in the first embodiment, the response signal Y representing the response speech Vy of the prosody Py according to the temporal change of the prosody Px of the speech speech Vx is generated. That is, the prosody Py of the response speech Vy changes in conjunction with the prosody Px of the speech speech Vx. Therefore, it is possible to realize a natural speech dialogue that simulates the tendency of a real dialogue in which the prosody of the response voice of the dialogue partner is interlocked with the change of the prosody of the speech.

<Specific example of prosody Px and prosody Py>
Specific examples of the prosody Px and the prosody Py in the first embodiment will be described.

(1) The first example of the prosody Px and the prosody Py is the pitch (fundamental frequency). When the user U raises the pitch of the speech Vx over time (that is, between two successive sounding periods Tx), the pitch of the response speech Vy for each speech Vx is interlocked with the rise. Also rise.

(2) The second example of prosody Px and prosody Py is volume. When the user U increases the volume of the speech voice Vx with time, the volume of the response voice Vy increases in conjunction with the increase.

(3) The third example of the prosody Px and the prosody Py is speech speed. Speaking speed means the speed of speech. For example, the number of phonemes included in speech within a unit time corresponds to the speech speed. When the user U raises the speech speed of the utterance voice Vx with time, the speech speed of the response speech Vy rises in conjunction with the rise.

(4) The fourth example of the prosody Px and the prosody Py is a spectrum width. The spectrum width is, for example, the difference between the maximum value and the minimum value in the envelope (spectral envelope) of the frequency spectrum of speech. When the user U utters so that the spectrum width of the utterance voice Vx increases with time, the spectrum width of the response voice Vy increases in conjunction with the increase.

(5) The fifth example of the prosody Px and the prosody Py is the pitch range. The pitch range is the fluctuation range of the pitch within the sound generation period (that is, the difference between the maximum value and the minimum value of the pitch within the sound generation period). When the user U increases the pitch width of the speech voice Vx with time, the pitch width of the response speech Vy increases in conjunction with the increase.

(6) The sixth example of the prosody Px and the prosody Py is the volume width. The sound volume width is the fluctuation range of the sound volume within the sound generation period (that is, the difference between the maximum value and the minimum value of the sound volume within the sound generation period). When the user U increases the volume width of the utterance voice Vx with time, the volume width of the response voice Vy increases with time, in conjunction with the increase. Note that the pitch range and the volume range correspond to the intonation (tone) of the sound. Therefore, in the fifth and sixth examples, the intonation of the response voice Vy changes in conjunction with the change of intonation in the speech voice Vx.

(7) The seventh example of the prosody Px and the prosody Py is the speech interval. The speech interval is an interval between two successive sounding periods in the voice dialogue (a time length from the end of the front sounding period to the start of the rear sounding period). In the first embodiment, the interval between the sound generation period Tx of the speech voice Vx and the sound generation period Ty of the response sound Vy corresponds to the sound generation interval.

For example, as illustrated in FIG. 5, the pronunciation interval between the (n-2) -th pronunciation period Ty_n-2 of the response voice Vy and the (n-1) -th pronunciation period Tx_n-1 of the speech voice Vx is prosody It is assumed that the pronunciation interval between the (n-1) -th pronunciation period Ty_n-1 of the response voice Vy and the n-th pronunciation period Tx_n of the speech voice Vx is specified as prosody Px_n, which is specified as Px_n-1. Do. The prosody change index Dx_n is calculated as a time length corresponding to the difference between the prosody Px_n and the prosody Px_n-1.

The response generation unit 36 generates the response signal Y so that the sound generation period Ty_n of the response voice Vy starts when the change amount Dy_n corresponding to the prosody change index Dx_n elapses from the end point of the sound generation period Tx_n. That is, the variation Dy_n is applied as the prosody Py_n (pronunciation interval) of the response voice Vy. The change amount Dy_n may be calculated according to the prosody change index Dx_n (that is, the difference between the prosody Px_n and the prosody Px_n-1) and a predetermined initial value. For example, the addition value of the prosody change index Dx_n and the initial value may be calculated as the change amount Dy_n. As understood from the above description, even in the configuration in which the prosody Px and the prosody Py are the speech intervals, the response signal representing the response speech Vy of the prosody Py according to the change (prosody change index Dx_n) of the prosody Px of the speech voice Vx. Y is generated.

In FIG. 5, attention is paid to the speech interval between the sound generation period Tx_n and the sound generation period Ty_n, but the speech period between the sound generation period Tx_n-1 and the sound generation period Ty_n-1 in FIG. 5 is the same as that described above. It is set according to the prosody change index Dx_n-1 set in the procedure. Further, at the beginning of the start of the speech dialogue, the change amount Dy is set to a predetermined initial value at a stage where the difference between the prosody Px can not be calculated for two successive sound generation periods Tx.

(8) The eighth example of the prosody Px and the prosody Py is the time length of the pronunciation period (hereinafter referred to as "speech length"). The utterance length is the time from the start to the end of the sound generation period. Specifically, as exemplified in FIG. 6, the time length of the (n-1) th pronunciation period Tx_n-1 of the speech Vx is specified as the prosody Px_n-1, and the n-th speech speech Vx is specified. It is assumed that the time length of the pronunciation period Tx_n is specified as prosody Px_n. The prosody change index Dx_n is calculated as a time length corresponding to the difference between the prosody Px_n and the prosody Px_n-1. The prosody change index Dx_n-1 corresponds to the difference between the prosody Px_n-1 calculated for the pronunciation period Tx_n-1 of the utterance voice Vx and the prosody Px_n-2 calculated for the immediately preceding pronunciation period Tx_n-2. , Calculated in the same way as described above.

The response generation unit 36 sets the response signal Y so that the prosody Py_n (that is, the utterance length) of the response voice Vy with respect to the utterance voice Vx in the pronunciation period Tx_n becomes a time length (change amount Dy_n) according to the prosody change index Dx_n. Generate That is, the variation Dy_n is applied as the prosody Py_n of the response speech Vy. Note that, for example, the addition value of the prosody change index Dx_n and the initial value may be calculated as the change amount Dy_n. As understood from the above description, even in the configuration in which the prosody Px and the prosody Py are the speech length, the response signal representing the response speech Vy of the prosody Py corresponding to the change (prosody change index Dx_n) of the prosody Px of the speech voice Vx. Y is generated. The change amount Dy is set to a predetermined initial value at the stage where the difference between the prosody Px can not be calculated for two successive sounding periods Tx at the beginning of the start of the speech dialogue.

Second Embodiment
A second embodiment of the present invention will be described. In addition, about the element which an operation | movement or a function is the same as 1st Embodiment in each aspect illustrated below, the code | symbol used by description of 1st Embodiment is diverted and detailed description of each is abbreviate | omitted suitably.

The response generation unit 36 of the first embodiment generates a response signal Y representing the response speech Vy of the prosody Py according to the temporal change of the prosody Px of the speech speech Vx. The response generation unit 36 of the second embodiment generates a response signal Y representing the response speech Vy of the prosody Py according to the value of the prosody Px of the speech speech Vx. That is, in the first embodiment, the prosody Py of the response speech Vy is controlled according to the relative value of the prosody Px (that is, the prosody change index Dx), while in the second embodiment, one numerical value of the prosody Px The prosody Py of the response speech Vy is controlled in accordance with. In the second embodiment, as in the first embodiment, the response generation unit 36 generates the response signal Y by adjusting the prosody Pz of the audio signal Z stored in the storage device 22 to the prosody Py. Also, the prosody Py is a feature of the same type as the prosody Px, but the numerical values are different.

Note that specific examples of the prosody Px and the prosody Py in the second embodiment are the same as in the first embodiment. For example, pitch, volume, speech speed, spectrum width, pitch width, volume width, speech interval and speech length are preferred examples of prosody Px and prosody Py. Also, an index value (for example, a change rate such as an increase rate or a decrease rate) indicating a tendency of temporal change of prosody such as pitch or volume may be adopted as the prosody Px and the prosody Py.

FIG. 7 is a flowchart of processing executed by the control device 20 according to the second embodiment. For example, the process of FIG. 7 is started in response to an instruction from the user U (for example, an instruction to start a program for voice dialogue) to the voice dialogue apparatus 100. When the process of FIG. 7 is started, the speech analysis unit 34 analyzes the speech signal X generated by the speech input device 24 to specify the prosody Px for one pronunciation period of the speech speech Vx (Sb1).

The response generation unit 36 generates a response signal Y of the prosody Py according to the prosody Px (Sb2). Specifically, the response generation unit 36 generates the response signal Y representing the response speech Vy of the prosody Py by changing the prosody Pz of the speech signal Z according to the prosody Py. Then, the response generation unit 36 reproduces the response voice Vy by supplying the response signal Y generated by the above processing to the reproduction device 26 (Sb3).

When the reproduction of the response voice Vy is completed, the control device determines whether the end of the voice dialogue has been instructed by the user U (Sb4). When the end of the voice interaction is not instructed (Sb4: NO), the process transitions to step Sb1. That is, the specification (Sb1) of the prosody Px of the speech Vx, the generation (Sb2) of the response signal Y of the prosody Py according to the prosody Px, and the reproduction (Sb3) of the response speech Vy It is repeated every Tx. Therefore, as in the first embodiment, a voice dialogue is realized in which the pronunciation of the arbitrary utterance voice Vx by the user U and the reproduction of the response voice Vy to the utterance voice Vx are alternately repeated.

As described above, in the second embodiment, the response signal Y representing the response speech Vy of the prosody Py according to the prosody Px of the speech speech Vx is generated. Therefore, it is possible to realize a natural speech dialogue that simulates the tendency of a real dialogue in which the prosody of the response voice of the dialogue partner is interlocked with the change of the prosody of the speech.

<Modification>
The aspect of the specific modification added to each aspect illustrated above is illustrated below. Two or more embodiments arbitrarily selected from the following exemplifications may be combined appropriately as long as they do not contradict each other.

(1) In each of the above-described embodiments, the response signal Y representing the response voice Vy of the prosody Py is generated by changing the prosody Pz of the speech signal Z by the change amount Dy_n corresponding to the prosody change index Dx_n. The prosody Py_n of the current response speech Vy may be set according to the index Dx_n and the prosody Py_n-1 of the immediately preceding response speech Vy. Specifically, the response generation unit 36 sets, as the prosody Py_n, a numerical value obtained by changing the prosody Py_n-1 according to the prosody change index Dx_n. For example, a value obtained by adding the prosody change index Dx_n to the prosody Py_n−1 is set as the prosody Py_n. Also in the above configuration, it is possible to generate the response signal Y representing the response speech Vy of the prosody Py according to the change of the prosody Px of the speech speech Vx.

(2) In each of the above-described embodiments, the same prosody Py of the response voice Vy is controlled according to the prosody Px of the utterance voice Vx, but the response speech Vy controlled according to the prosody Px of the utterance voice Vx and the prosody Px The prosody Py may be used as different types of feature quantities. For example, the volume (prosody Py) of the response voice Vy may be controlled according to the change of the pitch (prosody Px) of the speech voice Vx.

(3) In each of the above-described embodiments, the prosody Py of the response voice Vy is controlled according to the prosody Px of the utterance voice Vx, but plural types of prosody Py of the response voice Vy are controlled according to one prosody Px of the utterance voice Vx. May be controlled. For example, two or more prosody Py arbitrarily selected from pitch, volume, speech speed, spectrum width, pitch width, volume width, speech interval and speech length correspond to one kind of prosody Px of speech voice Vx. It is controlled. The combination (type and total number) of the prosody Py of the response speech Vy controlled according to the prosody Px is arbitrary.

The prosody Py of the response speech Vy may be controlled according to a plurality of prosody Px of the speech speech Vx. For example, two or more prosody Px arbitrarily selected from pitch, volume, speech speed, spectrum width, pitch width, volume width, speech interval and speech length are specified from speech voice Vx, and one type of response speech Vy It is used to control the prosody Py. Plural types of prosody Py may be controlled according to plural types of prosody Px. As understood from the above description, the combination (type and total number) of the prosody Px of the speech voice Vx applied to the control of the prosody Py of the response speech Vy is arbitrary.

(4) In the above-described embodiments, the prosody Py of the response voice Vy is controlled according to the prosody Px of the utterance voice Vx, but an element other than the prosody Px of the utterance voice Vx is applied to control of the prosody Py of the response voice Vy. May be For example, the prosody Py of the response voice Vy may be controlled according to the prosody Px of the speech voice Vx and the correction value (offset) set independently of the prosody Px. For example, the final prosody Py is calculated by adding the correction value to the provisional value set according to the prosody Px. The correction value may be either a fixed value or a variable value. For example, the correction value may be decreased as the time of speech dialogue using the speech dialogue apparatus 100 is longer.

(5) The prosody Py of the response speech Vy may be limited to a predetermined range. For example, when the provisional value of the prosody calculated according to the prosody Px of the speech Vx exceeds (or falls below) a predetermined threshold, the threshold is adopted as the prosody Py. According to the above configuration, it is possible to reduce the possibility that the prosody Py of the response voice Vy becomes an abnormal value and the voice dialogue becomes unnatural. Further, for example, when the provisional value of the prosody calculated in accordance with the prosody Px of the utterance voice Vx exceeds (or falls below) a predetermined threshold, a response voice Vy representing a questioning (rehearing) for the utterance is generated. It is also good.

(6) In the first embodiment, the difference between the prosody Px_n in the pronunciation period Tx_n of the speech Vx and the prosody Px_n-1 in the immediately preceding pronunciation period Tx_n-1 is calculated as the prosody change index Dx_n. The reference numerical value is not limited to the prosody Px_n-1 of the immediately preceding pronunciation period Tx_n-1. For example, a change in prosody Px_n with respect to the prosody Px in a sound generation period Tx (for example, two or more previous sound generation periods Tx) other than the last sound generation period Tx_n-1 may be calculated as the prosody change index Dx_n. Further, the prosody change index Dx_n may be calculated according to the change in the prosody Px over three or more pronunciation periods Tx. For example, the prosody change index Dx_n may be calculated according to a change in the current prosody Px_n with respect to a representative value (for example, an average value) of the prosody Px over a plurality of pronunciation periods Tx in the past.

(7) In the first embodiment, the difference between the prosody Px_n related to the speech Vx and the prosody Px_n-1 is calculated as the prosody change index Dx_n, but the calculation method of the prosody change index Dx_n is not limited to the above example. For example, the ratio between the prosody Px_n and the prosody Px_n-1 may be calculated as the prosody change index Dx_n (Dx_n = Px_n / Px_n-1). That is, the prosody change index Dx_n is comprehensively expressed as an index according to the change of the prosody Px of the speech voice Vx.

(8) In each embodiment described above, the prosody Py of the response voice Vy according to the difference between the prosody Px_n of the pronunciation period Tx_n of the utterance voice Vx and the prosody Px_n-1 of the immediately preceding pronunciation period Tx_n-1 (prosody change index Dx_n). However, the variable reflected to the prosody Py of the response speech Vy is not limited to the prosody change index Dx_n. For example, the prosody Py_n of the current response speech Vy may be set according to the prosody change index Dx_n and the prosody Py_n-1 of the immediately preceding response speech Vy. Further, the prosody difference (Py_n-2-Py_n-1) in the plurality of response voices Vy in the past may be applied to the setting of the prosody Py_n of the response voice Vy together with the prosody change index Dx_n.

(9) In each of the above-described embodiments, the response signal Y is generated and reproduced from the voice signal Z stored in the storage device 22, but the response signal Y representing the response voice Vy of a specific utterance content is It is also possible to synthesize by technology. For synthesis of the response signal Y, for example, speech synthesis using segment connection type speech or speech synthesis using a statistical model such as a hidden Markov model is preferably used. Also, the speech voice Vx and the response speech Vy are not limited to human speech. For example, it is also possible to use the vocalization of an animal as the utterance voice Vx and the response voice Vy.

(10) In each of the above-described embodiments, the voice interactive apparatus 100 exemplifies a configuration in which the voice input apparatus 24 and the reproduction apparatus 26 are provided. However, the voice interactive apparatus 100 separates voice from the voice communication apparatus 100 It is also possible to install the input device 24 and the playback device 26. The voice interaction device 100 is realized by a terminal device such as a mobile phone or a smart phone, for example, and the voice input / output device is realized by an electronic device such as an animal type toy or a robot. The voice interaction device 100 and the voice input / output device can communicate wirelessly or by wire. That is, the speech signal X generated by the voice input device 24 of the voice input / output device is transmitted to the voice interaction device 100 wirelessly or by wire, and the response signal Y generated by the voice interaction device 100 is wirelessly or by wire It is sent to the playback device 26.

(11) In the above-described embodiments, the voice interaction apparatus 100 is realized by an information processing apparatus such as a cellular phone or a personal computer. However, part or all of the function of the voice interaction apparatus 100 may be implemented by a server device (so-called cloud server). It is also possible to realize. Specifically, the voice interactive apparatus 100 is realized by a server device that communicates with the terminal device via a communication network such as a mobile communication network or the Internet. For example, the voice interaction device 100 receives the speech signal X generated by the speech input device 24 of the terminal device from the terminal device, and generates a response signal Y from the speech signal X by the configuration according to each of the above-described embodiments. Then, the voice interaction device 100 transmits the response signal Y generated from the speech signal X to the terminal device, and causes the reproduction device 26 of the terminal device to reproduce the response voice Vy. The voice interaction device 100 is realized by a single device or a set of a plurality of devices (ie, a server system). Whether each function realized by the voice interaction device 100 is realized by the server device or the terminal device (allocation of functions) is optional.

(12) In each of the above-described embodiments, the response voice Vy of a specific utterance content (for example, a compliment such as "う") is reproduced with respect to the utterance voice Vx, but the utterance content of the response voice Vy is limited to the above example. I will not. For example, the utterance content of the utterance voice Vx is analyzed by speech recognition and morphological analysis on the utterance signal X, and the response voice Vy having an appropriate content for the utterance content is selected or synthesized from a plurality of candidates and reproduced on the reproduction device 26 It is also possible to In the configuration in which the speech recognition and the morphological analysis are not performed, the response speech Vy of the speech content prepared in advance regardless of the speech speech Vx is reproduced. Therefore, simply thinking, natural dialogue may be inferred as not being established, but as in the above-described respective forms, the prosody of the response speech Vy is variously controlled, and in fact, it is possible to communicate between humans. It is possible for the user U to sense the sense of natural dialogue of On the other hand, according to the configuration in which the speech recognition and morphological analysis are not performed, there is an advantage that the processing delay and the processing load due to these processes can be reduced or eliminated.

(13) In the above-described embodiments, the response signal Y of the response voice Vy is generated by adjusting the prosody Pz of the voice signal Z, but the method of generating the response signal Y is not limited to the above example. For example, a plurality of speech signals Z having different prosody Pz are stored in the storage unit 22, and the plurality of speech signals Z are closest to a prosody numerical value (hereinafter referred to as "target value") corresponding to the prosody change index Dx. It is also possible to select the speech signal Z of the prosody Pz as the response signal Y. That is, the process of selecting the response signal Y from the plurality of candidates (the audio signal Z) is an example of the process of generating the response signal Y. Alternatively, the response signal Y may be generated from two or more audio signals Z selected in the order in which the prosody Pz is close to the target value among the plurality of audio signals Z. For example, the response signal Y is generated by weighted sum or interpolation of two or more audio signals Z.

(14) It is also possible to use the voice interaction apparatus 100 exemplified in each of the above-described embodiments for evaluation of actual human interaction. For example, comparing the prosody of the response speech (hereinafter referred to as "observed speech") observed in an actual human interaction with the prosody of the response speech Vy generated in the above-described form, the prosody is similar between the two It is possible to evaluate the observation speech as appropriate while evaluating the observation speech as appropriate, but when the prosody diverges between the two. The apparatus (interaction evaluation apparatus) which performs the evaluation illustrated above may be used for training of human dialogue.

(15) The voice interaction device 100 exemplified in each of the above-described embodiments is realized by the cooperation of the control device 20 and the program for voice interaction as described above.

A program according to a first aspect (for example, the first embodiment) of the present invention includes a voice analysis process (Sa1) for specifying a feature quantity of a first voice represented by a first voice signal for each pronunciation period in a computer; And response generation processing (Sa2 and sa3) for generating a second audio signal representing a second audio of the feature amount according to the change of the feature amount of the first audio during the sound generation period. Further, a program according to a second aspect (for example, the second embodiment) of the present invention includes a voice analysis process (Sb1) for specifying a feature amount of a first voice represented by a first voice signal in a computer; And a response generation process (Sb2) for generating a second audio signal representing a second sound of the feature amount according to the feature amount of

The program according to each of the above aspects is provided in a form stored in a computer readable recording medium and installed in the computer. The recording medium is, for example, a non-transitory recording medium, and is preferably an optical recording medium (optical disc) such as a CD-ROM, but any known medium such as a semiconductor recording medium or a magnetic recording medium may be used. Recording media of the form Note that "non-transitory recording medium" includes all computer readable recording media except transient propagation signals, and does not exclude volatile recording media. . Also, the program may be distributed to the computer in the form of distribution via a communication network.

(16) The following configuration, for example, can be grasped from the embodiments exemplified above.
<Aspect 1>
In the sound processing method according to a preferred aspect (aspect 1) of the present invention, the computer specifies, for each sound generation period, the feature amount of the first sound represented by the first sound signal, the first sound in a plurality of sound generation periods A second audio signal representing a second voice of the feature amount according to the change of the feature amount is generated. In the above aspect, the second audio signal representing the second audio of the feature amount according to the change of the feature amount of the first audio is generated. Therefore, for example, it is possible to realize a natural voice dialogue that simulates the tendency of a real dialogue in which the feature quantity of the response voice of the dialogue partner is interlocked with the change of the feature quantity of the uttered voice.

A speech processing apparatus according to a preferred aspect (aspect 2) of the present invention comprises: a speech analysis unit for specifying a feature quantity of a first speech represented by a first speech signal for each pronunciation period; and the first speech in a plurality of pronunciation periods And a response generation unit configured to generate a second audio signal representing a second voice of the feature amount according to the change of the feature amount of In the above aspect, the second audio signal representing the second audio of the feature amount according to the change of the feature amount of the first audio is generated. Therefore, for example, it is possible to realize a natural voice dialogue that simulates the tendency of a real dialogue in which the feature quantity of the response voice of the dialogue partner is interlocked with the change of the feature quantity of the uttered voice.

<Other aspects>
In the preferable example of aspect 1 or aspect 2, the feature quantity of the first voice and the feature quantity of the second voice according to the change of the feature quantity are the pitch, the volume, the speech speed, the spectrum width (spectrum envelope (A variation amount of), a variation range of pitch within a sound generation period, a variation range of a sound volume within a sound generation period, an interval of successive sound generation periods, and at least one of a time length of the sound generation period.

100: voice interaction device 20: control device 22: storage device 24: voice input device 242: sound collection device 244: A / D converter 26: playback device 262 ... D / A converter, 264 ... sound emitting device, 32 ... voice acquisition unit, 34 ... voice analysis unit, 36 ... response generation unit.

Claims

Identifying the feature amount of the first voice represented by the first voice signal for each sound generation period;
A computer-implemented audio processing method for generating a second audio signal representing a second audio of a feature according to a change of the feature of the first audio in a plurality of sound generation periods.
The voice processing method according to claim 1, wherein the feature amount of the second voice is a pitch.
The audio processing method according to claim 1, wherein the feature amount of the second audio is a volume.
The voice processing method according to any one of claims 1 to 3, wherein the feature amount of the second voice is a speech speed.
The voice processing method according to any one of claims 1 to 4, wherein the feature quantity of the second voice is a spectrum width which is a variation of a spectrum envelope.
The voice processing method according to any one of claims 1 to 5, wherein the feature amount of the second voice is a fluctuation range of a pitch within a sound generation period.
The voice processing method according to any one of claims 1 to 6, wherein the feature amount of the second voice is a fluctuation range of a volume within a sound generation period.
The voice processing method according to any one of claims 1 to 7, wherein the feature amount of the second voice is an interval between successive sound generation periods.
The voice processing method according to any one of claims 1 to 8, wherein the feature amount of the second voice is a time length of a sound generation period.
A voice analysis unit that specifies the feature quantity of the first voice represented by the first voice signal for each pronunciation period;
A voice processing apparatus comprising: a response generation unit that generates a second voice signal representing a second voice of a feature amount according to a change in the feature amount of the first voice in a plurality of sound generation periods.