[go: up one dir, main page]

WO2019059094A1 - Speech processing method and speech processing device - Google Patents

Speech processing method and speech processing device Download PDF

Info

Publication number
WO2019059094A1
WO2019059094A1 PCT/JP2018/034010 JP2018034010W WO2019059094A1 WO 2019059094 A1 WO2019059094 A1 WO 2019059094A1 JP 2018034010 W JP2018034010 W JP 2018034010W WO 2019059094 A1 WO2019059094 A1 WO 2019059094A1
Authority
WO
WIPO (PCT)
Prior art keywords
prosody
voice
speech
response
change
Prior art date
Application number
PCT/JP2018/034010
Other languages
French (fr)
Japanese (ja)
Inventor
嘉山 啓
Original Assignee
ヤマハ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ヤマハ株式会社 filed Critical ヤマハ株式会社
Publication of WO2019059094A1 publication Critical patent/WO2019059094A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Definitions

  • the present invention relates to a technique suitable for speech interaction.
  • Patent Document 1 discloses a technique for analyzing the content of an utterance by speech recognition of the user's utterance voice, and synthesizing and reproducing a response voice according to the analysis result.
  • Patent Document 1 aims to realize a natural voice dialogue.
  • a voice processing method specifies a feature of a first voice represented by a first voice signal for each pronunciation period, and A second audio signal representing a second audio of the feature amount according to the change of the audio feature amount is generated.
  • the voice processing apparatus is a voice analysis unit that specifies the feature quantity of the first voice represented by the first voice signal for each pronunciation period, and the features of the first voice in a plurality of pronunciation periods. And a response generator configured to generate a second audio signal representing a second audio of the feature amount according to the change in amount.
  • FIG. 1 is a block diagram of the voice interactive apparatus 100 according to the first embodiment of the present invention.
  • the voice interactive apparatus 100 according to the first embodiment is a computer system that reproduces a voice (hereinafter referred to as “response voice”) Vy that responds to an input voice (hereinafter referred to as “speech voice”) Vx pronounced by the user U.
  • a portable information processing apparatus such as a cellular phone or a smartphone, or an information processing apparatus such as a personal computer is used as the voice interaction apparatus 100.
  • the speech sound Vx is, for example, the speech of a speech including a question (question) and a speech
  • the response speech Vy is a speech of a response including a response to a question or a response to a speech.
  • the response speech Vy includes, for example, speech that means interjections. Interjections are independent words (touch verbs or exclamations) that are used independently of other segments and not used. Specifically, words such as "un” and “ee” which indicate the reciprocity to the utterance ("aha” or "right” in English), "h” Words such as ...
  • the voice interaction device 100 is a voice processing device that generates a response voice Vy (example of a second voice) of a feature amount according to the feature amount of the speech voice Vx (example of a first voice).
  • the feature amount is, for example, prosody (prosody).
  • Prosody is a linguistic and phonetic characteristic that can be perceived by the listener of speech, meaning that it can not be grasped only from the general representation of the language (for example, the representation excluding the special notation for prosody). Do.
  • the prosody is also referred to as a characteristic that allows the listener to recall or guess the speaker's intention or emotion.
  • intonation change in tone of speech or intonation
  • tone high or low or strong in voice
  • length speech length
  • speech speed rhythm
  • rhythm structure of temporal change in tone
  • accent various features such as high and low or strong and weak accents are included in the concept of prosody
  • typical examples of prosody are pitch (fundamental frequency) or volume.
  • the voice interaction device 100 includes a control device 20, a storage device 22, a voice input device 24, and a reproduction device 26.
  • the voice input device 24 is an element that generates, for example, a voice signal (hereinafter referred to as a "voice signal") X representing the voice voice Vx of the user U, and includes a sound collection device 242 and an A / D converter 244.
  • the sound collection device 242 picks up the speech voice Vx (an example of the first speech signal) uttered by the user U and generates an analog speech signal representing the sound pressure fluctuation of the speech voice Vx.
  • the A / D converter 244 converts the audio signal generated by the sound collection device 242 into a digital speech signal X.
  • the control device 20 is an arithmetic processing unit (for example, a CPU) that comprehensively controls each element of the voice interaction device 100.
  • the control device 20 acquires the speech signal X supplied from the speech input device 24, and generates a response signal Y (exemplary second speech signal) representing a response speech Vy to the speech speech Vx.
  • the reproduction device 26 is an element that reproduces the response voice Vy according to the response signal Y generated by the control device 20, and includes a D / A converter 262 and a sound emission device 264.
  • the D / A converter 262 converts the digital response signal Y generated by the control device 20 into an analog audio signal, and the sound emitting device 264 (for example, a speaker or headphone) converts the response audio according to the converted audio signal. It emits Vy as a sound wave.
  • the reproduction device 26 also includes a processing circuit such as an amplifier for amplifying the response signal Y.
  • the speech signal X and the response signal Y are, for example, audio data in wav format.
  • the storage device 22 stores a program executed by the control device 20 and various data used by the control device 20.
  • a known recording medium such as a semiconductor recording medium or a magnetic recording medium, or a combination of a plurality of recording mediums is arbitrarily adopted as the storage device 22.
  • the storage device 22 according to the first embodiment stores an audio signal Z representing a response voice of specific utterance content.
  • the following description exemplifies a case where the voice signal Z of the response voice such as “Y”, which means “sumo” that is an example of interjection, is stored in the storage device 22.
  • the audio signal Z is recorded in advance and stored in the storage device 22 as audio data of an arbitrary format such as wav format.
  • the control device 20 implements a plurality of functions (the voice analysis unit 34 and the response generation unit 36) for establishing a voice dialogue with the user U by executing the program stored in the storage device 22.
  • a configuration in which the functions of control device 20 are realized by a plurality of devices (i.e., systems) or a configuration in which a dedicated electronic circuit realizes a part of the functions of control device 20 may be employed.
  • the speech analysis unit 34 specifies the prosody Px of the speech voice Vx from the speech signal X generated by the speech input device 24.
  • the prosody Px is an acoustic feature that can be extracted from the speech signal X.
  • the voice analysis unit 34 of the first embodiment sequentially specifies the prosody Px for each pronunciation period of the speech voice Vx.
  • the speech analysis unit 34 specifies the numerical value of the specific type of prosody Px required by the program being executed among the plurality of types.
  • An arbitrary one pronunciation period is a series of periods which are grasped as one utterance (for example, an inquiry and a talk) by the user U.
  • the volume of the speech voice Vx continuously exceeds a predetermined threshold. It is a period.
  • a speech period corresponding to one response may be defined as a pronunciation period.
  • the speech analysis unit 34 specifies a representative value (for example, an average value) of a plurality of prosody specified at a predetermined cycle in the sound generation period as the prosody Px of the sound generation period.
  • the prosody at a specific time (for example, an end point) within the pronunciation period may be specified as the prosody Px of the pronunciation period.
  • the prosody Px may be specified from the point in time immediately before the last phoneme of the utterance voice Vx during the pronunciation period.
  • the response generator 36 generates a response signal Y representing the response voice Vy. Specifically, the response generation unit 36 generates a response signal Y representing the response speech Vy of the prosody Py corresponding to the temporal change of the prosody Px specified by the speech analysis unit 34.
  • the change in prosody Px is an example of the “change in feature value”. As described above, since the prosody Px is specified for each pronunciation period, the temporal change of the prosody Px means the change of the prosody Px during the successive pronunciation periods, and the change of the prosody within one pronunciation period is not.
  • the prosody Py is a feature of the same type as the prosody Px, but the numerical values are different.
  • the response generation unit 36 of the first embodiment generates a response signal Y by adjusting the prosody Pz of the audio signal Z stored in the storage device 22 to the prosody Py.
  • the response signal V generated by the response generation unit 36 is supplied to the reproduction device 26, whereby the response sound Vy is reproduced. That is, the response voice Vy obtained by adjusting the initial response voice represented by the voice signal Z in accordance with the prosody Px of the speech voice Vx is reproduced from the playback device 26.
  • FIG. 2 is a flowchart of processing executed by the control device 20 according to the first embodiment.
  • the process of FIG. 2 is started in response to an instruction from the user U (for example, an instruction to start a program for speech interaction) to the speech interaction apparatus 100.
  • the speech analysis unit 34 analyzes the speech signal X generated by the speech input device 24 to specify the prosody Px for one pronunciation period Tx of the speech speech Vx (Sa1).
  • the prosody Px basically has a numerical value determined at the end of the sound generation period Tx, the numerical value may be determined at a point in the middle of the sound generation period Tx.
  • FIG. 1 is a flowchart of processing executed by the control device 20 according to the first embodiment.
  • the process of FIG. 2 is started in response to an instruction from the user U (for example, an instruction to start a program for speech interaction) to the speech interaction apparatus 100.
  • the speech analysis unit 34 analyzes the speech signal X generated by the speech input device 24 to specify the prosody Px for one pronunciation period Tx
  • FIG. 3 illustrates the prosody Px_n calculated for the n-th pronunciation period Tx_n of the utterance voice Vx (n is a natural number). That is, FIG. 3 is an explanatory view of a process performed when the utterance (for example, an inquiry or a talk) of the sound generation period Tx_n by the user U is completed.
  • the response generation unit 36 generates a response signal Y of the prosody Py corresponding to the prosody change index Dx_n (Sa3). Specifically, as illustrated in FIG. 3, the response generation unit 36 represents the response speech Vy of the prosody Py by changing the prosody Pz of the speech signal Z by the change amount Dy_n according to the prosody change index Dx_n. A response signal Y is generated. Note that, at the stage when the speech voice Vx in the first pronunciation period Tx_1 is produced, the difference in prosody Px can not be calculated for two successive pronunciation periods Tx. Therefore, the change amount Dy_1 is set to a predetermined initial value.
  • the prosody change index Dx_n-1 corresponds to the difference between the prosody Px_n-1 calculated for the pronunciation period Tx_n-1 of the speech Vx and the prosody Px_n-2 calculated for the immediately preceding pronunciation period Tx_n-2. , Calculated in the same way as described above.
  • FIG. 4 is a graph showing the relationship between the prosody change index Dx and the change amount Dy (the difference between the prosody Pz and the prosody Py).
  • the graph of FIG. 4 corresponds to a rule for determining the change amount Dy from the prosody change index Dx.
  • the change amount Dy is determined so that the change amount Dy linearly increases with respect to the increase of the prosody change index Dx.
  • the change amount Dy is set to a numerical value equal to the prosody change index Dx.
  • the prosody Px_n exceeds the prosody Px_n-1 (ie, the prosody Px of the speech Vx increases)
  • the prosody Py of the response speech Vy is set to a value exceeding the prosody Pz of the speech signal Z.
  • the prosody Px_n is less than the prosody Px_n-1 (ie, if the prosody Px of the speech Vx is decreased)
  • the prosody Py of the response speech Vy is set to a value less than the prosody Pz of the speech signal Z.
  • the relationship between the prosody change index Dx and the change amount Dy is not limited to the above example. For example, as exemplified by the broken line in FIG.
  • the change amount Dy may be changed non-linearly with respect to the prosody change index Dx.
  • the addition value of the prosody change index Dx_n and the initial value may be calculated as the change amount Dy_n. That is, the relationship between the prosody change index Dx and the change amount Dy may be a relationship in which the prosody Py of the response voice Vy is a prosody suitable for the prosody Px of the speech voice Vx.
  • the variation Dy is set which represents the degree to which the prosody Py of the response speech Vy is changed. That is, the change amount Dy for adjusting the prosody Py of the response speech Vy to be output immediately after is set from the prosody change index Dx indicating the change of the prosody Px of the speech voice Vx which is sequentially after each other.
  • the prosody Py of the response speech Vy set by the above method is the prosody of the result of adjusting the prosody Pz of the speech signal Z to be in harmony with the utterance such as the inquiry or the speech.
  • the response generation unit 36 reproduces the response voice Vy by supplying the response signal Y generated by the above processing to the reproduction device 26 (Sa4).
  • the control device 20 determines whether the end of the voice dialogue has been instructed by the user U (Sa5). When the end of the voice dialogue is not instructed (Sa5: NO), the control device 20 shifts the processing to step Sa1.
  • the specification (Sa1) of the prosody Px of the speech Vx, the calculation of the prosody change index Dx (Sa2), and the generation of the response signal Y of the prosody Py according to the prosody change index Dx (Sa3 And the reproduction (Sa4) of the response voice Vy are repeated at every sound generation period Tx of the speech voice Vx. That is, the process from step Sa1 to step Sa4 is performed every time the user U pronounces the speech voice Vx (every time the speech signal X is input). Therefore, a voice dialogue is realized in which the pronunciation of the arbitrary speech voice Vx by the user U and the reproduction of the response speech Vy to the speech voice Vx are alternately repeated.
  • the processing from step Sa1 to step Sa4 is sequentially performed for each utterance period Tx during speech (input) by the user U, and corresponds to an operation of generating a response to one utterance voice Vx.
  • the response signal Y representing the response speech Vy of the prosody Py according to the temporal change of the prosody Px of the speech speech Vx is generated. That is, the prosody Py of the response speech Vy changes in conjunction with the prosody Px of the speech speech Vx. Therefore, it is possible to realize a natural speech dialogue that simulates the tendency of a real dialogue in which the prosody of the response voice of the dialogue partner is interlocked with the change of the prosody of the speech.
  • the first example of the prosody Px and the prosody Py is the pitch (fundamental frequency).
  • the pitch of the response speech Vy for each speech Vx is interlocked with the rise. Also rise.
  • prosody Px and prosody Py are volume.
  • the volume of the speech voice Vx with time the volume of the response voice Vy increases in conjunction with the increase.
  • the third example of the prosody Px and the prosody Py is speech speed.
  • Speaking speed means the speed of speech. For example, the number of phonemes included in speech within a unit time corresponds to the speech speed.
  • the speech speed of the response speech Vy rises in conjunction with the rise.
  • the fourth example of the prosody Px and the prosody Py is a spectrum width.
  • the spectrum width is, for example, the difference between the maximum value and the minimum value in the envelope (spectral envelope) of the frequency spectrum of speech.
  • the fifth example of the prosody Px and the prosody Py is the pitch range.
  • the pitch range is the fluctuation range of the pitch within the sound generation period (that is, the difference between the maximum value and the minimum value of the pitch within the sound generation period).
  • the sixth example of the prosody Px and the prosody Py is the volume width.
  • the sound volume width is the fluctuation range of the sound volume within the sound generation period (that is, the difference between the maximum value and the minimum value of the sound volume within the sound generation period).
  • the volume width of the response voice Vy increases with time, in conjunction with the increase.
  • the pitch range and the volume range correspond to the intonation (tone) of the sound. Therefore, in the fifth and sixth examples, the intonation of the response voice Vy changes in conjunction with the change of intonation in the speech voice Vx.
  • the seventh example of the prosody Px and the prosody Py is the speech interval.
  • the speech interval is an interval between two successive sounding periods in the voice dialogue (a time length from the end of the front sounding period to the start of the rear sounding period).
  • the interval between the sound generation period Tx of the speech voice Vx and the sound generation period Ty of the response sound Vy corresponds to the sound generation interval.
  • the pronunciation interval between the (n-2) -th pronunciation period Ty_n-2 of the response voice Vy and the (n-1) -th pronunciation period Tx_n-1 of the speech voice Vx is prosody It is assumed that the pronunciation interval between the (n-1) -th pronunciation period Ty_n-1 of the response voice Vy and the n-th pronunciation period Tx_n of the speech voice Vx is specified as prosody Px_n, which is specified as Px_n-1. Do.
  • the prosody change index Dx_n is calculated as a time length corresponding to the difference between the prosody Px_n and the prosody Px_n-1.
  • the response generation unit 36 generates the response signal Y so that the sound generation period Ty_n of the response voice Vy starts when the change amount Dy_n corresponding to the prosody change index Dx_n elapses from the end point of the sound generation period Tx_n. That is, the variation Dy_n is applied as the prosody Py_n (pronunciation interval) of the response voice Vy.
  • the change amount Dy_n may be calculated according to the prosody change index Dx_n (that is, the difference between the prosody Px_n and the prosody Px_n-1) and a predetermined initial value. For example, the addition value of the prosody change index Dx_n and the initial value may be calculated as the change amount Dy_n.
  • the response signal representing the response speech Vy of the prosody Py according to the change (prosody change index Dx_n) of the prosody Px of the speech voice Vx. Y is generated.
  • the speech period between the sound generation period Tx_n and the sound generation period Ty_n is the same as that described above. It is set according to the prosody change index Dx_n-1 set in the procedure. Further, at the beginning of the start of the speech dialogue, the change amount Dy is set to a predetermined initial value at a stage where the difference between the prosody Px can not be calculated for two successive sound generation periods Tx.
  • the eighth example of the prosody Px and the prosody Py is the time length of the pronunciation period (hereinafter referred to as "speech length").
  • the utterance length is the time from the start to the end of the sound generation period.
  • the time length of the (n-1) th pronunciation period Tx_n-1 of the speech Vx is specified as the prosody Px_n-1
  • the n-th speech speech Vx is specified. It is assumed that the time length of the pronunciation period Tx_n is specified as prosody Px_n.
  • the prosody change index Dx_n is calculated as a time length corresponding to the difference between the prosody Px_n and the prosody Px_n-1.
  • the prosody change index Dx_n-1 corresponds to the difference between the prosody Px_n-1 calculated for the pronunciation period Tx_n-1 of the utterance voice Vx and the prosody Px_n-2 calculated for the immediately preceding pronunciation period Tx_n-2. , Calculated in the same way as described above.
  • the response generation unit 36 sets the response signal Y so that the prosody Py_n (that is, the utterance length) of the response voice Vy with respect to the utterance voice Vx in the pronunciation period Tx_n becomes a time length (change amount Dy_n) according to the prosody change index Dx_n.
  • the variation Dy_n is applied as the prosody Py_n of the response speech Vy.
  • the addition value of the prosody change index Dx_n and the initial value may be calculated as the change amount Dy_n.
  • Y is generated.
  • the change amount Dy is set to a predetermined initial value at the stage where the difference between the prosody Px can not be calculated for two successive sounding periods Tx at the beginning of the start of the speech dialogue.
  • Second Embodiment A second embodiment of the present invention will be described.
  • symbol used by description of 1st Embodiment is diverted and detailed description of each is abbreviate
  • the response generation unit 36 of the first embodiment generates a response signal Y representing the response speech Vy of the prosody Py according to the temporal change of the prosody Px of the speech speech Vx.
  • the response generation unit 36 of the second embodiment generates a response signal Y representing the response speech Vy of the prosody Py according to the value of the prosody Px of the speech speech Vx. That is, in the first embodiment, the prosody Py of the response speech Vy is controlled according to the relative value of the prosody Px (that is, the prosody change index Dx), while in the second embodiment, one numerical value of the prosody Px The prosody Py of the response speech Vy is controlled in accordance with.
  • the response generation unit 36 generates the response signal Y by adjusting the prosody Pz of the audio signal Z stored in the storage device 22 to the prosody Py.
  • the prosody Py is a feature of the same type as the prosody Px, but the numerical values are different.
  • prosody Px and the prosody Py in the second embodiment are the same as in the first embodiment.
  • pitch, volume, speech speed, spectrum width, pitch width, volume width, speech interval and speech length are preferred examples of prosody Px and prosody Py.
  • an index value for example, a change rate such as an increase rate or a decrease rate
  • a tendency of temporal change of prosody such as pitch or volume may be adopted as the prosody Px and the prosody Py.
  • FIG. 7 is a flowchart of processing executed by the control device 20 according to the second embodiment.
  • the process of FIG. 7 is started in response to an instruction from the user U (for example, an instruction to start a program for voice dialogue) to the voice dialogue apparatus 100.
  • the speech analysis unit 34 analyzes the speech signal X generated by the speech input device 24 to specify the prosody Px for one pronunciation period of the speech speech Vx (Sb1).
  • the response generation unit 36 generates a response signal Y of the prosody Py according to the prosody Px (Sb2). Specifically, the response generation unit 36 generates the response signal Y representing the response speech Vy of the prosody Py by changing the prosody Pz of the speech signal Z according to the prosody Py. Then, the response generation unit 36 reproduces the response voice Vy by supplying the response signal Y generated by the above processing to the reproduction device 26 (Sb3).
  • the control device determines whether the end of the voice dialogue has been instructed by the user U (Sb4).
  • the process transitions to step Sb1. That is, the specification (Sb1) of the prosody Px of the speech Vx, the generation (Sb2) of the response signal Y of the prosody Py according to the prosody Px, and the reproduction (Sb3) of the response speech Vy It is repeated every Tx. Therefore, as in the first embodiment, a voice dialogue is realized in which the pronunciation of the arbitrary utterance voice Vx by the user U and the reproduction of the response voice Vy to the utterance voice Vx are alternately repeated.
  • the response signal Y representing the response speech Vy of the prosody Py according to the prosody Px of the speech speech Vx is generated. Therefore, it is possible to realize a natural speech dialogue that simulates the tendency of a real dialogue in which the prosody of the response voice of the dialogue partner is interlocked with the change of the prosody of the speech.
  • the response signal Y representing the response voice Vy of the prosody Py is generated by changing the prosody Pz of the speech signal Z by the change amount Dy_n corresponding to the prosody change index Dx_n.
  • the prosody Py_n of the current response speech Vy may be set according to the index Dx_n and the prosody Py_n-1 of the immediately preceding response speech Vy.
  • the response generation unit 36 sets, as the prosody Py_n, a numerical value obtained by changing the prosody Py_n-1 according to the prosody change index Dx_n. For example, a value obtained by adding the prosody change index Dx_n to the prosody Py_n ⁇ 1 is set as the prosody Py_n. Also in the above configuration, it is possible to generate the response signal Y representing the response speech Vy of the prosody Py according to the change of the prosody Px of the speech speech Vx.
  • the same prosody Py of the response voice Vy is controlled according to the prosody Px of the utterance voice Vx, but the response speech Vy controlled according to the prosody Px of the utterance voice Vx and the prosody Px
  • the prosody Py may be used as different types of feature quantities.
  • the volume (prosody Py) of the response voice Vy may be controlled according to the change of the pitch (prosody Px) of the speech voice Vx.
  • the prosody Py of the response voice Vy is controlled according to the prosody Px of the utterance voice Vx, but plural types of prosody Py of the response voice Vy are controlled according to one prosody Px of the utterance voice Vx. May be controlled.
  • two or more prosody Py arbitrarily selected from pitch, volume, speech speed, spectrum width, pitch width, volume width, speech interval and speech length correspond to one kind of prosody Px of speech voice Vx. It is controlled.
  • the combination (type and total number) of the prosody Py of the response speech Vy controlled according to the prosody Px is arbitrary.
  • the prosody Py of the response speech Vy may be controlled according to a plurality of prosody Px of the speech speech Vx.
  • two or more prosody Px arbitrarily selected from pitch, volume, speech speed, spectrum width, pitch width, volume width, speech interval and speech length are specified from speech voice Vx, and one type of response speech Vy It is used to control the prosody Py.
  • Plural types of prosody Py may be controlled according to plural types of prosody Px.
  • the combination (type and total number) of the prosody Px of the speech voice Vx applied to the control of the prosody Py of the response speech Vy is arbitrary.
  • the prosody Py of the response voice Vy is controlled according to the prosody Px of the utterance voice Vx, but an element other than the prosody Px of the utterance voice Vx is applied to control of the prosody Py of the response voice Vy.
  • the prosody Py of the response voice Vy may be controlled according to the prosody Px of the speech voice Vx and the correction value (offset) set independently of the prosody Px.
  • the final prosody Py is calculated by adding the correction value to the provisional value set according to the prosody Px.
  • the correction value may be either a fixed value or a variable value.
  • the correction value may be decreased as the time of speech dialogue using the speech dialogue apparatus 100 is longer.
  • the prosody Py of the response speech Vy may be limited to a predetermined range. For example, when the provisional value of the prosody calculated according to the prosody Px of the speech Vx exceeds (or falls below) a predetermined threshold, the threshold is adopted as the prosody Py. According to the above configuration, it is possible to reduce the possibility that the prosody Py of the response voice Vy becomes an abnormal value and the voice dialogue becomes unnatural. Further, for example, when the provisional value of the prosody calculated in accordance with the prosody Px of the utterance voice Vx exceeds (or falls below) a predetermined threshold, a response voice Vy representing a questioning (rehearing) for the utterance is generated. It is also good.
  • the difference between the prosody Px_n in the pronunciation period Tx_n of the speech Vx and the prosody Px_n-1 in the immediately preceding pronunciation period Tx_n-1 is calculated as the prosody change index Dx_n.
  • the reference numerical value is not limited to the prosody Px_n-1 of the immediately preceding pronunciation period Tx_n-1.
  • a change in prosody Px_n with respect to the prosody Px in a sound generation period Tx (for example, two or more previous sound generation periods Tx) other than the last sound generation period Tx_n-1 may be calculated as the prosody change index Dx_n.
  • the prosody change index Dx_n may be calculated according to the change in the prosody Px over three or more pronunciation periods Tx.
  • the prosody change index Dx_n may be calculated according to a change in the current prosody Px_n with respect to a representative value (for example, an average value) of the prosody Px over a plurality of pronunciation periods Tx in the past.
  • the difference between the prosody Px_n related to the speech Vx and the prosody Px_n-1 is calculated as the prosody change index Dx_n, but the calculation method of the prosody change index Dx_n is not limited to the above example.
  • the prosody Py of the response voice Vy according to the difference between the prosody Px_n of the pronunciation period Tx_n of the utterance voice Vx and the prosody Px_n-1 of the immediately preceding pronunciation period Tx_n-1 (prosody change index Dx_n).
  • the variable reflected to the prosody Py of the response speech Vy is not limited to the prosody change index Dx_n.
  • the prosody Py_n of the current response speech Vy may be set according to the prosody change index Dx_n and the prosody Py_n-1 of the immediately preceding response speech Vy.
  • the prosody difference (Py_n-2-Py_n-1) in the plurality of response voices Vy in the past may be applied to the setting of the prosody Py_n of the response voice Vy together with the prosody change index Dx_n.
  • the response signal Y is generated and reproduced from the voice signal Z stored in the storage device 22, but the response signal Y representing the response voice Vy of a specific utterance content is It is also possible to synthesize by technology.
  • speech synthesis using segment connection type speech or speech synthesis using a statistical model such as a hidden Markov model is preferably used.
  • the speech voice Vx and the response speech Vy are not limited to human speech. For example, it is also possible to use the vocalization of an animal as the utterance voice Vx and the response voice Vy.
  • the voice interactive apparatus 100 exemplifies a configuration in which the voice input apparatus 24 and the reproduction apparatus 26 are provided.
  • the voice interactive apparatus 100 separates voice from the voice communication apparatus 100 It is also possible to install the input device 24 and the playback device 26.
  • the voice interaction device 100 is realized by a terminal device such as a mobile phone or a smart phone, for example, and the voice input / output device is realized by an electronic device such as an animal type toy or a robot.
  • the voice interaction device 100 and the voice input / output device can communicate wirelessly or by wire.
  • the speech signal X generated by the voice input device 24 of the voice input / output device is transmitted to the voice interaction device 100 wirelessly or by wire, and the response signal Y generated by the voice interaction device 100 is wirelessly or by wire It is sent to the playback device 26.
  • the voice interaction apparatus 100 is realized by an information processing apparatus such as a cellular phone or a personal computer. However, part or all of the function of the voice interaction apparatus 100 may be implemented by a server device (so-called cloud server). It is also possible to realize. Specifically, the voice interactive apparatus 100 is realized by a server device that communicates with the terminal device via a communication network such as a mobile communication network or the Internet. For example, the voice interaction device 100 receives the speech signal X generated by the speech input device 24 of the terminal device from the terminal device, and generates a response signal Y from the speech signal X by the configuration according to each of the above-described embodiments.
  • a server device that communicates with the terminal device via a communication network such as a mobile communication network or the Internet.
  • the voice interaction device 100 receives the speech signal X generated by the speech input device 24 of the terminal device from the terminal device, and generates a response signal Y from the speech signal X by the configuration according to each of the above-described embodiments.
  • the voice interaction device 100 transmits the response signal Y generated from the speech signal X to the terminal device, and causes the reproduction device 26 of the terminal device to reproduce the response voice Vy.
  • the voice interaction device 100 is realized by a single device or a set of a plurality of devices (ie, a server system). Whether each function realized by the voice interaction device 100 is realized by the server device or the terminal device (allocation of functions) is optional.
  • the response voice Vy of a specific utterance content (for example, a compliment such as " ⁇ ") is reproduced with respect to the utterance voice Vx, but the utterance content of the response voice Vy is limited to the above example. I will not.
  • the utterance content of the utterance voice Vx is analyzed by speech recognition and morphological analysis on the utterance signal X, and the response voice Vy having an appropriate content for the utterance content is selected or synthesized from a plurality of candidates and reproduced on the reproduction device 26
  • the response speech Vy of the speech content prepared in advance regardless of the speech speech Vx is reproduced. Therefore, simply thinking, natural dialogue may be inferred as not being established, but as in the above-described respective forms, the prosody of the response speech Vy is variously controlled, and in fact, it is possible to communicate between humans. It is possible for the user U to sense the sense of natural dialogue of On the other hand, according to the configuration in which the speech recognition and morphological analysis are not performed, there is an advantage that the processing delay and the processing load due to these processes can be reduced or eliminated.
  • the response signal Y of the response voice Vy is generated by adjusting the prosody Pz of the voice signal Z, but the method of generating the response signal Y is not limited to the above example.
  • a plurality of speech signals Z having different prosody Pz are stored in the storage unit 22, and the plurality of speech signals Z are closest to a prosody numerical value (hereinafter referred to as "target value") corresponding to the prosody change index Dx.
  • target value a prosody numerical value
  • the response signal Y may be generated from two or more audio signals Z selected in the order in which the prosody Pz is close to the target value among the plurality of audio signals Z.
  • the response signal Y is generated by weighted sum or interpolation of two or more audio signals Z.
  • the voice interaction apparatus 100 exemplified in each of the above-described embodiments for evaluation of actual human interaction. For example, comparing the prosody of the response speech (hereinafter referred to as "observed speech") observed in an actual human interaction with the prosody of the response speech Vy generated in the above-described form, the prosody is similar between the two It is possible to evaluate the observation speech as appropriate while evaluating the observation speech as appropriate, but when the prosody diverges between the two.
  • the apparatus (interaction evaluation apparatus) which performs the evaluation illustrated above may be used for training of human dialogue.
  • the voice interaction device 100 exemplified in each of the above-described embodiments is realized by the cooperation of the control device 20 and the program for voice interaction as described above.
  • a program includes a voice analysis process (Sa1) for specifying a feature quantity of a first voice represented by a first voice signal for each pronunciation period in a computer; And response generation processing (Sa2 and sa3) for generating a second audio signal representing a second audio of the feature amount according to the change of the feature amount of the first audio during the sound generation period.
  • Sa1 voice analysis process
  • response generation processing Sa2 and sa3
  • a program according to a second aspect (for example, the second embodiment) of the present invention includes a voice analysis process (Sb1) for specifying a feature amount of a first voice represented by a first voice signal in a computer; And a response generation process (Sb2) for generating a second audio signal representing a second sound of the feature amount according to the feature amount of
  • the program according to each of the above aspects is provided in a form stored in a computer readable recording medium and installed in the computer.
  • the recording medium is, for example, a non-transitory recording medium, and is preferably an optical recording medium (optical disc) such as a CD-ROM, but any known medium such as a semiconductor recording medium or a magnetic recording medium may be used.
  • Recording media of the form Note that "non-transitory recording medium” includes all computer readable recording media except transient propagation signals, and does not exclude volatile recording media.
  • the program may be distributed to the computer in the form of distribution via a communication network.
  • the computer specifies, for each sound generation period, the feature amount of the first sound represented by the first sound signal, the first sound in a plurality of sound generation periods A second audio signal representing a second voice of the feature amount according to the change of the feature amount is generated.
  • the second audio signal representing the second audio of the feature amount according to the change of the feature amount of the first audio is generated. Therefore, for example, it is possible to realize a natural voice dialogue that simulates the tendency of a real dialogue in which the feature quantity of the response voice of the dialogue partner is interlocked with the change of the feature quantity of the uttered voice.
  • a speech processing apparatus comprises: a speech analysis unit for specifying a feature quantity of a first speech represented by a first speech signal for each pronunciation period; and the first speech in a plurality of pronunciation periods And a response generation unit configured to generate a second audio signal representing a second voice of the feature amount according to the change of the feature amount of
  • the second audio signal representing the second audio of the feature amount according to the change of the feature amount of the first audio is generated. Therefore, for example, it is possible to realize a natural voice dialogue that simulates the tendency of a real dialogue in which the feature quantity of the response voice of the dialogue partner is interlocked with the change of the feature quantity of the uttered voice.
  • the feature quantity of the first voice and the feature quantity of the second voice according to the change of the feature quantity are the pitch, the volume, the speech speed, the spectrum width (spectrum envelope (A variation amount of), a variation range of pitch within a sound generation period, a variation range of a sound volume within a sound generation period, an interval of successive sound generation periods, and at least one of a time length of the sound generation period.
  • voice interaction device 20 control device 22: storage device 24: voice input device 242: sound collection device 244: A / D converter 26: playback device 262 ... D / A converter, 264 ... sound emitting device, 32 ... voice acquisition unit, 34 ... voice analysis unit, 36 ... response generation unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

A speech processing device includes a speech analysis unit for identifying a feature amount of a first speech represented by a first speech signal for each utterance period, and a response generation unit for generating a second speech signal representing a second speech with a feature amount corresponding to a change in the feature amount of the first speech during a plurality of utterance periods.

Description

音声処理方法および音声処理装置Voice processing method and voice processing apparatus
 本発明は、音声対話に好適な技術に関する。 The present invention relates to a technique suitable for speech interaction.
 利用者による発話に対する応答(例えば質問に対する回答)の音声を再生することで利用者との対話を実現する音声対話の技術が従来から提案されている。例えば特許文献1には、利用者の発話音声に対する音声認識で発話内容を解析し、解析結果に応じた応答音声を合成および再生する技術が開示されている。 Conventionally, a technology of speech dialogue has been proposed which realizes dialogue with the user by reproducing the speech of the response to the speech by the user (for example, the answer to the question). For example, Patent Document 1 discloses a technique for analyzing the content of an utterance by speech recognition of the user's utterance voice, and synthesizing and reproducing a response voice according to the analysis result.
特開2012-128440号公報JP 2012-128440 A
 しかし、特許文献1を含む既存の技術のもとでは、現実の人間同士の対話の傾向を忠実に反映した自然な音声対話を実現することは実際には困難であり、機械的で不自然な印象を利用者が感取し得るという問題がある。以上の事情を考慮して、本発明は、自然な音声対話の実現を目的とする。 However, under the existing technology including Patent Document 1, it is actually difficult to realize a natural speech dialogue that faithfully reflects the tendency of real human dialogue, and it is mechanical and unnatural. There is a problem that the user can feel an impression. In consideration of the above circumstances, the present invention aims to realize a natural voice dialogue.
 以上の課題を解決するために、本発明の好適な態様に係る音声処理方法は、第1音声信号が表す第1音声の特徴量を発音期間毎に特定し、複数の発音期間における前記第1音声の特徴量の変化に応じた特徴量の第2音声を表す第2音声信号を生成する。 In order to solve the above problems, a voice processing method according to a preferred aspect of the present invention specifies a feature of a first voice represented by a first voice signal for each pronunciation period, and A second audio signal representing a second audio of the feature amount according to the change of the audio feature amount is generated.
 また、本発明の好適な態様に係る音声処理装置は、第1音声信号が表す第1音声の特徴量を発音期間毎に特定する音声解析部と、複数の発音期間における前記第1音声の特徴量の変化に応じた特徴量の第2音声を表す第2音声信号を生成する応答生成部とを具備する。 The voice processing apparatus according to a preferred aspect of the present invention is a voice analysis unit that specifies the feature quantity of the first voice represented by the first voice signal for each pronunciation period, and the features of the first voice in a plurality of pronunciation periods. And a response generator configured to generate a second audio signal representing a second audio of the feature amount according to the change in amount.
第1実施形態に係る音声対話装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice interactive apparatus which concerns on 1st Embodiment. 音声対話装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of a voice interactive apparatus. 発話音声の韻律と応答音声の韻律との関係を示す説明図である。It is explanatory drawing which shows the relationship between the prosody of speech speech and the prosody of response speech. 韻律変化指標と応答音声の韻律の変化量との関係を示すグラフである。It is a graph which shows the relationship between a prosody change index and the change amount of the prosody of response speech. 発音間隔を韻律とした場合における応答音声の韻律の説明図である。It is explanatory drawing of the prosody of the response speech in, when a pronunciation interval is made into a prosody. 発音長を韻律とした場合における応答音声の韻律の説明図である。It is explanatory drawing of the prosody of the response speech in, when a pronunciation length is made into a prosody. 第2実施形態における音声対話装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the voice interactive apparatus in 2nd Embodiment.
<第1実施形態>
 図1は、本発明の第1実施形態に係る音声対話装置100の構成図である。第1実施形態の音声対話装置100は、利用者Uが発音した入力音声(以下「発話音声」という)Vxに対して応答する音声(以下「応答音声」という)Vyを再生するコンピュータシステムである。例えば携帯電話機またはスマートフォン等の可搬型の情報処理装置、または、パーソナルコンピュータ等の情報処理装置が音声対話装置100として利用される。また、動物等の外観を模擬した玩具(例えば動物のぬいぐるみ等の人形)またはロボットの形態で音声対話装置100を実現することも可能である。
First Embodiment
FIG. 1 is a block diagram of the voice interactive apparatus 100 according to the first embodiment of the present invention. The voice interactive apparatus 100 according to the first embodiment is a computer system that reproduces a voice (hereinafter referred to as “response voice”) Vy that responds to an input voice (hereinafter referred to as “speech voice”) Vx pronounced by the user U. . For example, a portable information processing apparatus such as a cellular phone or a smartphone, or an information processing apparatus such as a personal computer is used as the voice interaction apparatus 100. In addition, it is also possible to realize the voice interaction device 100 in the form of a toy (for example, a doll such as a stuffed animal) or a robot that simulates the appearance of an animal or the like.
 発話音声(speech sound)Vxは、例えば問掛け(質問)および話掛けを含む発話の音声であり、応答音声Vyは、問掛けに対する回答または話掛けに対する受応えを含む応答の音声である。応答音声Vyには、例えば間投詞を意味する音声も含まれる。間投詞は、他の分節から独立して利用されて活用のない自立語(感動詞または感嘆詞)である。具体的には、発話に対する相鎚を表す「うん(un)」および「ええ(ee)」等の語句(英語では“aha”または“right”)、言淀み(応答の停滞)を表す「え~と(eto)」および「あの~(ano)」等の語句(英語では“um”または“er”)、応答(質問に対する肯定または否定)を表す「はい(hai)」および「いいえ(iie)」等の語句(英語では“yes”または“no”)、話者の感動を表す「ああ(aa)」および「おお(oo)」等の語句(英語では“ah”または“woo”)、ならびに、発話に対する問返し(聞き直し)を意味する「え?(e)」「なに?(nani)」等の語句(英語では“pardon?”または“sorry?”)が、間投詞として例示される。 The speech sound Vx is, for example, the speech of a speech including a question (question) and a speech, and the response speech Vy is a speech of a response including a response to a question or a response to a speech. The response speech Vy includes, for example, speech that means interjections. Interjections are independent words (touch verbs or exclamations) that are used independently of other segments and not used. Specifically, words such as "un" and "ee" which indicate the reciprocity to the utterance ("aha" or "right" in English), "h" Words such as ... and (eto) and "that" (ano) ("um" or "er" in English), "hai" and "no (iie)" that represent responses (positive or negative to a question) And so on ("yes" or "no" in English), "a (aa)" and "o (oo)" (in English, "ah" or "woo") to indicate the impression of the speaker. And words such as "e? (E)" or "nani" ("in English", "pardon?" Or "sorry?") Meaning interrogation as an interjection. Be done.
 第1実施形態の音声対話装置100は、発話音声Vx(第1音声の例示)の特徴量に応じた特徴量の応答音声Vy(第2音声の例示)を生成する音声処理装置である。特徴量は、例えば韻律(プロソディ)である。韻律は、音声の受聴者が知覚し得る言語学的および音声学的な特性であり、言語の一般的な表記(例えば韻律を表す特別な表記を除いた表記)のみからでは把握できない性質を意味する。韻律は、発話者の意図または感情を受聴者に想起ないし推測させ得る特性とも換言される。具体的には、抑揚(音声の調子の変化もしくはイントネーション),音調(音声の高低もしくは強弱),音長(発話長),話速,リズム(音調の時間的な変化の構造),またはアクセント(高低もしくは強弱のアクセント)等の種々の特徴が、韻律の概念には包含されるが、韻律の典型例は音高(基本周波数)または音量である。 The voice interaction device 100 according to the first embodiment is a voice processing device that generates a response voice Vy (example of a second voice) of a feature amount according to the feature amount of the speech voice Vx (example of a first voice). The feature amount is, for example, prosody (prosody). Prosody is a linguistic and phonetic characteristic that can be perceived by the listener of speech, meaning that it can not be grasped only from the general representation of the language (for example, the representation excluding the special notation for prosody). Do. The prosody is also referred to as a characteristic that allows the listener to recall or guess the speaker's intention or emotion. Specifically, it includes intonation (change in tone of speech or intonation), tone (high or low or strong in voice), length (speech length), speech speed, rhythm (structure of temporal change in tone), or accent ( Although various features such as high and low or strong and weak accents are included in the concept of prosody, typical examples of prosody are pitch (fundamental frequency) or volume.
 図1に例示される通り、第1実施形態の音声対話装置100は、制御装置20と記憶装置22と音声入力装置24と再生装置26とを具備する。音声入力装置24は、例えば利用者Uの発話音声Vxを表す音声信号(以下「発話信号」という)Xを生成する要素であり、収音装置242とA/D変換器244とを具備する。収音装置242は、利用者Uが発音した発話音声Vx(第1音声信号の例示)を収音して当該発話音声Vxの音圧変動を表すアナログの音声信号を生成する。A/D変換器244は、収音装置242が生成した音声信号をデジタルの発話信号(speech signal)Xに変換する。 As illustrated in FIG. 1, the voice interaction device 100 according to the first embodiment includes a control device 20, a storage device 22, a voice input device 24, and a reproduction device 26. The voice input device 24 is an element that generates, for example, a voice signal (hereinafter referred to as a "voice signal") X representing the voice voice Vx of the user U, and includes a sound collection device 242 and an A / D converter 244. The sound collection device 242 picks up the speech voice Vx (an example of the first speech signal) uttered by the user U and generates an analog speech signal representing the sound pressure fluctuation of the speech voice Vx. The A / D converter 244 converts the audio signal generated by the sound collection device 242 into a digital speech signal X.
 制御装置20は、音声対話装置100の各要素を統括的に制御する演算処理装置(例えばCPU)である。第1実施形態の制御装置20は、音声入力装置24から供給される発話信号Xを取得し、発話音声Vxに対する応答音声Vyを表す応答信号Y(第2音声信号の例示)を生成する。再生装置26は、制御装置20が生成した応答信号Yに応じた応答音声Vyを再生する要素であり、D/A変換器262と放音装置264とを具備する。D/A変換器262は、制御装置20が生成したデジタルの応答信号Yをアナログの音声信号に変換し、放音装置264(例えばスピーカまたはヘッドホン)は、変換後の音声信号に応じた応答音声Vyを音波として放音する。再生装置26には、応答信号Yを増幅する増幅器等の処理回路も包含される。発話信号Xおよび応答信号Yは、例えばwav形式の音声データである。 The control device 20 is an arithmetic processing unit (for example, a CPU) that comprehensively controls each element of the voice interaction device 100. The control device 20 according to the first embodiment acquires the speech signal X supplied from the speech input device 24, and generates a response signal Y (exemplary second speech signal) representing a response speech Vy to the speech speech Vx. The reproduction device 26 is an element that reproduces the response voice Vy according to the response signal Y generated by the control device 20, and includes a D / A converter 262 and a sound emission device 264. The D / A converter 262 converts the digital response signal Y generated by the control device 20 into an analog audio signal, and the sound emitting device 264 (for example, a speaker or headphone) converts the response audio according to the converted audio signal. It emits Vy as a sound wave. The reproduction device 26 also includes a processing circuit such as an amplifier for amplifying the response signal Y. The speech signal X and the response signal Y are, for example, audio data in wav format.
 記憶装置22は、制御装置20が実行するプログラムと制御装置20が使用する各種のデータとを記憶する。例えば半導体記録媒体または磁気記録媒体等の公知の記録媒体、あるいは、複数の記録媒体の組合せが記憶装置22として任意に採用される。第1実施形態の記憶装置22は、特定の発話内容の応答音声を表す音声信号Zを記憶する。以下の説明では、間投詞の一例である相鎚を意味する「うん」等の応答音声の音声信号Zが記憶装置22に記憶された場合を例示する。音声信号Zは、事前に収録され、例えばwav形式等の任意の形式の音声データとして記憶装置22に記憶される。 The storage device 22 stores a program executed by the control device 20 and various data used by the control device 20. For example, a known recording medium such as a semiconductor recording medium or a magnetic recording medium, or a combination of a plurality of recording mediums is arbitrarily adopted as the storage device 22. The storage device 22 according to the first embodiment stores an audio signal Z representing a response voice of specific utterance content. The following description exemplifies a case where the voice signal Z of the response voice such as “Y”, which means “sumo” that is an example of interjection, is stored in the storage device 22. The audio signal Z is recorded in advance and stored in the storage device 22 as audio data of an arbitrary format such as wav format.
 制御装置20は、記憶装置22に記憶されたプログラムを実行することで、利用者Uとの音声対話を成立させるための複数の機能(音声解析部34および応答生成部36)を実現する。なお、制御装置20の機能を複数の装置(すなわちシステム)で実現した構成、または、制御装置20の機能の一部を専用の電子回路が実現する構成を採用してもよい。 The control device 20 implements a plurality of functions (the voice analysis unit 34 and the response generation unit 36) for establishing a voice dialogue with the user U by executing the program stored in the storage device 22. A configuration in which the functions of control device 20 are realized by a plurality of devices (i.e., systems) or a configuration in which a dedicated electronic circuit realizes a part of the functions of control device 20 may be employed.
 音声解析部34は、音声入力装置24が生成した発話信号Xから発話音声Vxの韻律Pxを特定する。韻律Pxは、発話信号Xから抽出可能な音響的な特徴量である。第1実施形態の音声解析部34は、発話音声Vxの発音期間毎に韻律Pxを順次に特定する。前述の通り発話音声Vxについては複数種の韻律が特定され得るが、音声解析部34は、複数種のうち実行中のプログラムが必要とする特定の種類の韻律Pxの数値を特定する。任意の1個の発音期間は、利用者Uによる1回分の発話(例えば問掛けおよび話掛け)として把握される一連の期間であり、例えば発話音声Vxの音量が継続的に所定の閾値を上回る期間である。1回の応答に対応する発話の期間を発音期間として定義してもよい。具体的には、音声解析部34は、発音期間内において所定の周期で特定された複数の韻律の代表値(例えば平均値)を当該発音期間の韻律Pxとして特定する。また、発音期間内の特定の時点(例えば終点)における韻律を当該発音期間の韻律Pxとして特定してもよい。発音期間のうち発話音声Vxの最後の音韻の直前の時点から韻律Pxを特定してもよい。 The speech analysis unit 34 specifies the prosody Px of the speech voice Vx from the speech signal X generated by the speech input device 24. The prosody Px is an acoustic feature that can be extracted from the speech signal X. The voice analysis unit 34 of the first embodiment sequentially specifies the prosody Px for each pronunciation period of the speech voice Vx. As described above, although plural types of prosody may be specified for the utterance voice Vx, the speech analysis unit 34 specifies the numerical value of the specific type of prosody Px required by the program being executed among the plurality of types. An arbitrary one pronunciation period is a series of periods which are grasped as one utterance (for example, an inquiry and a talk) by the user U. For example, the volume of the speech voice Vx continuously exceeds a predetermined threshold. It is a period. A speech period corresponding to one response may be defined as a pronunciation period. Specifically, the speech analysis unit 34 specifies a representative value (for example, an average value) of a plurality of prosody specified at a predetermined cycle in the sound generation period as the prosody Px of the sound generation period. Also, the prosody at a specific time (for example, an end point) within the pronunciation period may be specified as the prosody Px of the pronunciation period. The prosody Px may be specified from the point in time immediately before the last phoneme of the utterance voice Vx during the pronunciation period.
 応答生成部36は、応答音声Vyを表す応答信号Yを生成する。具体的には、応答生成部36は、音声解析部34が特定した韻律Pxの時間的な変化に応じた韻律Pyの応答音声Vyを表す応答信号Yを生成する。韻律Pxの変化は、「特徴量の変化」の一例である。前述の通り韻律Pxは発音期間毎に特定されるから、韻律Pxの時間的な変化は、相前後する発音期間の間における韻律Pxの変化を意味し、1個の発音期間内における韻律の変化ではない。韻律Pyは、韻律Pxと同種の特徴量であるが数値は相違する。第1実施形態の応答生成部36は、記憶装置22に記憶された音声信号Zの韻律Pzを韻律Pyに調整することで応答信号Yを生成する。応答生成部36が生成した応答信号Yが再生装置26に供給されることで応答音声Vyが再生される。すなわち、音声信号Zが表す初期的な応答音声を発話音声Vxの韻律Pxに応じて調整した応答音声Vyが再生装置26から再生される。 The response generator 36 generates a response signal Y representing the response voice Vy. Specifically, the response generation unit 36 generates a response signal Y representing the response speech Vy of the prosody Py corresponding to the temporal change of the prosody Px specified by the speech analysis unit 34. The change in prosody Px is an example of the “change in feature value”. As described above, since the prosody Px is specified for each pronunciation period, the temporal change of the prosody Px means the change of the prosody Px during the successive pronunciation periods, and the change of the prosody within one pronunciation period is not. The prosody Py is a feature of the same type as the prosody Px, but the numerical values are different. The response generation unit 36 of the first embodiment generates a response signal Y by adjusting the prosody Pz of the audio signal Z stored in the storage device 22 to the prosody Py. The response signal V generated by the response generation unit 36 is supplied to the reproduction device 26, whereby the response sound Vy is reproduced. That is, the response voice Vy obtained by adjusting the initial response voice represented by the voice signal Z in accordance with the prosody Px of the speech voice Vx is reproduced from the playback device 26.
 図2は、第1実施形態の制御装置20が実行する処理のフローチャートである。例えば音声対話装置100に対する利用者Uからの指示(例えば音声対話用のプログラムの起動指示)を契機として図2の処理が開始される。図2の処理を開始すると、音声解析部34は、音声入力装置24が生成した発話信号Xを解析することで、発話音声Vxの1個の発音期間Txについて韻律Pxを特定する(Sa1)。なお、韻律Pxは、基本的には発音期間Txの終了とともに数値が確定するが、発音期間Txの途中の時点で数値を確定させてもよい。図3には、発話音声Vxの第n番目の発音期間Tx_nについて算定された韻律Px_nが図示されている(nは自然数)。すなわち、図3は、利用者Uによる発音期間Tx_nの発話(例えば問掛けまたは話掛け)が完了した段階で実行される処理の説明図である。 FIG. 2 is a flowchart of processing executed by the control device 20 according to the first embodiment. For example, the process of FIG. 2 is started in response to an instruction from the user U (for example, an instruction to start a program for speech interaction) to the speech interaction apparatus 100. When the process of FIG. 2 is started, the speech analysis unit 34 analyzes the speech signal X generated by the speech input device 24 to specify the prosody Px for one pronunciation period Tx of the speech speech Vx (Sa1). Note that although the prosody Px basically has a numerical value determined at the end of the sound generation period Tx, the numerical value may be determined at a point in the middle of the sound generation period Tx. FIG. 3 illustrates the prosody Px_n calculated for the n-th pronunciation period Tx_n of the utterance voice Vx (n is a natural number). That is, FIG. 3 is an explanatory view of a process performed when the utterance (for example, an inquiry or a talk) of the sound generation period Tx_n by the user U is completed.
 応答生成部36は、発話音声Vxの韻律Pxの変化の指標(以下「韻律変化指標」という)Dxを算定する(Sa2)。具体的には、音声解析部34は、図3に例示される通り、発話音声Vxの最新の発音期間Tx_nについて算定された韻律Px_nと、直前の発音期間Tx_n-1について算定された韻律Px_n-1との差分を、韻律変化指標Dx_n(Dx_n=Px_n-Px_n-1)として算定する。すなわち、韻律変化指標Dx_nは、相前後する2回分の発話音声Vxの間における韻律の差分(相前後する2回の発話の間における韻律の変化)の指標である。 The response generation unit 36 calculates an index of change in prosody Px of the speech Vx (hereinafter referred to as “prosody change index”) Dx (Sa2). Specifically, as illustrated in FIG. 3, the speech analysis unit 34 calculates the prosody Px_n calculated for the latest pronunciation period Tx_n of the utterance voice Vx and the prosody Px_n− calculated for the immediately preceding pronunciation period Tx_n−1. The difference with 1 is calculated as a prosody change index Dx_n (Dx_n = Px_n−Px_n−1). That is, the prosody change index Dx_n is an index of a difference in prosody (a change in prosody between two consecutive utterances) between two adjacent utterance voices Vx.
 応答生成部36は、韻律変化指標Dx_nに応じた韻律Pyの応答信号Yを生成する(Sa3)。具体的には、応答生成部36は、図3に例示される通り、韻律変化指標Dx_nに応じた変化量Dy_nだけ音声信号Zの韻律Pzを変化させることで、韻律Pyの応答音声Vyを表す応答信号Yを生成する。なお、最初の発音期間Tx_1の発話音声Vxが発音された段階では、相前後する2個の発音期間Txについて韻律Pxの差分を算定できない。したがって、変化量Dy_1は所定の初期値に設定される。また、韻律変化指標Dx_n-1は、発話音声Vxの発音期間Tx_n-1について算定された韻律Px_n-1と、直前の発音期間Tx_n-2について算定された韻律Px_n-2との差分に応じて、以上に説明したのと同様の手順で算定される。 The response generation unit 36 generates a response signal Y of the prosody Py corresponding to the prosody change index Dx_n (Sa3). Specifically, as illustrated in FIG. 3, the response generation unit 36 represents the response speech Vy of the prosody Py by changing the prosody Pz of the speech signal Z by the change amount Dy_n according to the prosody change index Dx_n. A response signal Y is generated. Note that, at the stage when the speech voice Vx in the first pronunciation period Tx_1 is produced, the difference in prosody Px can not be calculated for two successive pronunciation periods Tx. Therefore, the change amount Dy_1 is set to a predetermined initial value. In addition, the prosody change index Dx_n-1 corresponds to the difference between the prosody Px_n-1 calculated for the pronunciation period Tx_n-1 of the speech Vx and the prosody Px_n-2 calculated for the immediately preceding pronunciation period Tx_n-2. , Calculated in the same way as described above.
 図4は、韻律変化指標Dxと変化量Dy(韻律Pzと韻律Pyとの差分)との関係を示すグラフである。図4のグラフは、韻律変化指標Dxから変化量Dyを決定するためのルールに相当する。図4に実線で例示される通り、韻律変化指標Dxの増加に対して変化量Dyが直線状に増加するように変化量Dyが決定される。例えば、変化量Dyは、韻律変化指標Dxと等しい数値に設定される。したがって、韻律Px_nが韻律Px_n-1を上回る場合(すなわち発話音声Vxの韻律Pxが増加した場合)には、応答音声Vyの韻律Pyは音声信号Zの韻律Pzを上回る数値に設定される。他方、韻律Px_nが韻律Px_n-1を下回る場合(すなわち発話音声Vxの韻律Pxが減少した場合)には、応答音声Vyの韻律Pyは音声信号Zの韻律Pzを下回る数値に設定される。なお、韻律変化指標Dxと変化量Dyとの関係は以上の例示に限定されない。例えば図4の破線で例示される通り、韻律変化指標Dxに対して変化量Dyを非線形に変化させてもよい。また、例えば、韻律変化指標Dx_nと初期値との加算値を変化量Dy_nとして算定してもよい。すなわち、韻律変化指標Dxと変化量Dyとの関係は、応答音声Vyの韻律Pyが発話音声Vxの韻律Pxに適した韻律となる関係であればよい。 FIG. 4 is a graph showing the relationship between the prosody change index Dx and the change amount Dy (the difference between the prosody Pz and the prosody Py). The graph of FIG. 4 corresponds to a rule for determining the change amount Dy from the prosody change index Dx. As exemplified by a solid line in FIG. 4, the change amount Dy is determined so that the change amount Dy linearly increases with respect to the increase of the prosody change index Dx. For example, the change amount Dy is set to a numerical value equal to the prosody change index Dx. Therefore, when the prosody Px_n exceeds the prosody Px_n-1 (ie, the prosody Px of the speech Vx increases), the prosody Py of the response speech Vy is set to a value exceeding the prosody Pz of the speech signal Z. On the other hand, if the prosody Px_n is less than the prosody Px_n-1 (ie, if the prosody Px of the speech Vx is decreased), the prosody Py of the response speech Vy is set to a value less than the prosody Pz of the speech signal Z. The relationship between the prosody change index Dx and the change amount Dy is not limited to the above example. For example, as exemplified by the broken line in FIG. 4, the change amount Dy may be changed non-linearly with respect to the prosody change index Dx. Also, for example, the addition value of the prosody change index Dx_n and the initial value may be calculated as the change amount Dy_n. That is, the relationship between the prosody change index Dx and the change amount Dy may be a relationship in which the prosody Py of the response voice Vy is a prosody suitable for the prosody Px of the speech voice Vx.
 以上の説明から理解される通り、図4に例示した所定のルールのもとで、応答音声Vyの韻律Pyを変化させる度合を表す変化量Dyが設定される。すなわち、相前後する発話音声Vxの韻律Pxの変化を示す韻律変化指標Dxから、直後に出力する応答音声Vyの韻律Pyを調整するための変化量Dyが設定される。以上の方法で設定された応答音声Vyの韻律Pyは、音声信号Zの韻律Pzを、問掛けまたは話掛け等の発話に調和するように調整した結果の韻律である。 As understood from the above description, under the predetermined rule illustrated in FIG. 4, the variation Dy is set which represents the degree to which the prosody Py of the response speech Vy is changed. That is, the change amount Dy for adjusting the prosody Py of the response speech Vy to be output immediately after is set from the prosody change index Dx indicating the change of the prosody Px of the speech voice Vx which is sequentially after each other. The prosody Py of the response speech Vy set by the above method is the prosody of the result of adjusting the prosody Pz of the speech signal Z to be in harmony with the utterance such as the inquiry or the speech.
 応答生成部36は、以上の処理で生成した応答信号Yを再生装置26に供給することで応答音声Vyを再生する(Sa4)。応答音声Vyの再生が完了すると、制御装置20は、音声対話の終了が利用者Uから指示されたか否かを判定する(Sa5)。音声対話の終了が指示されていない場合(Sa5:NO)、制御装置20は処理をステップSa1に移行する。以上の説明から理解される通り、発話音声Vxの韻律Pxの特定(Sa1)と、韻律変化指標Dxの算定(Sa2)と、韻律変化指標Dxに応じた韻律Pyの応答信号Yの生成(Sa3)と、応答音声Vyの再生(Sa4)とが、発話音声Vxの発音期間Tx毎に反復される。すなわち、利用者Uによる発話音声Vxの発音毎(発話信号Xの入力毎)にステップSa1からステップSa4の処理が実行される。したがって、利用者Uによる任意の発話音声Vxの発音と、当該発話音声Vxに対する応答音声Vyの再生とが交互に反復される音声対話が実現される。ステップSa1からステップSa4の処理は、利用者Uによる発話(入力)に発音期間Tx毎に逐次的に実行され、1回分の発話音声Vxに対する応答を生成する動作に相当する。 The response generation unit 36 reproduces the response voice Vy by supplying the response signal Y generated by the above processing to the reproduction device 26 (Sa4). When the reproduction of the response voice Vy is completed, the control device 20 determines whether the end of the voice dialogue has been instructed by the user U (Sa5). When the end of the voice dialogue is not instructed (Sa5: NO), the control device 20 shifts the processing to step Sa1. As understood from the above description, the specification (Sa1) of the prosody Px of the speech Vx, the calculation of the prosody change index Dx (Sa2), and the generation of the response signal Y of the prosody Py according to the prosody change index Dx (Sa3 And the reproduction (Sa4) of the response voice Vy are repeated at every sound generation period Tx of the speech voice Vx. That is, the process from step Sa1 to step Sa4 is performed every time the user U pronounces the speech voice Vx (every time the speech signal X is input). Therefore, a voice dialogue is realized in which the pronunciation of the arbitrary speech voice Vx by the user U and the reproduction of the response speech Vy to the speech voice Vx are alternately repeated. The processing from step Sa1 to step Sa4 is sequentially performed for each utterance period Tx during speech (input) by the user U, and corresponds to an operation of generating a response to one utterance voice Vx.
 以上に説明した通り、第1実施形態では、発話音声Vxの韻律Pxの時間的な変化に応じた韻律Pyの応答音声Vyを表す応答信号Yが生成される。すなわち、発話音声Vxの韻律Pxに連動して応答音声Vyの韻律Pyが変化する。したがって、発話音声の韻律の変化に対話相手の応答音声の韻律が連動するという現実の対話の傾向を模擬した自然な音声対話を実現することが可能である。 As described above, in the first embodiment, the response signal Y representing the response speech Vy of the prosody Py according to the temporal change of the prosody Px of the speech speech Vx is generated. That is, the prosody Py of the response speech Vy changes in conjunction with the prosody Px of the speech speech Vx. Therefore, it is possible to realize a natural speech dialogue that simulates the tendency of a real dialogue in which the prosody of the response voice of the dialogue partner is interlocked with the change of the prosody of the speech.
<韻律Pxおよび韻律Pyの具体例>
 第1実施形態における韻律Pxおよび韻律Pyの具体例を説明する。
<Specific example of prosody Px and prosody Py>
Specific examples of the prosody Px and the prosody Py in the first embodiment will be described.
(1)韻律Pxおよび韻律Pyの第1例は音高(基本周波数)である。利用者Uが発話音声Vxの音高を経時的に(すなわち相前後する2個の発音期間Txの間で)上昇させると、その上昇に連動して各発話音声Vxに対する応答音声Vyの音高も上昇する。 (1) The first example of the prosody Px and the prosody Py is the pitch (fundamental frequency). When the user U raises the pitch of the speech Vx over time (that is, between two successive sounding periods Tx), the pitch of the response speech Vy for each speech Vx is interlocked with the rise. Also rise.
(2)韻律Pxおよび韻律Pyの第2例は音量である。利用者Uが発話音声Vxの音量を経時的に増加させると、その増加に連動して応答音声Vyの音量が増加する。 (2) The second example of prosody Px and prosody Py is volume. When the user U increases the volume of the speech voice Vx with time, the volume of the response voice Vy increases in conjunction with the increase.
(3)韻律Pxおよび韻律Pyの第3例は話速である。話速は、発話の速度を意味する。例えば単位時間内の音声に含まれる音素の個数が話速に相当する。利用者Uが発話音声Vxの話速を経時的に上昇させると、その上昇に連動して応答音声Vyの話速が上昇する。 (3) The third example of the prosody Px and the prosody Py is speech speed. Speaking speed means the speed of speech. For example, the number of phonemes included in speech within a unit time corresponds to the speech speed. When the user U raises the speech speed of the utterance voice Vx with time, the speech speed of the response speech Vy rises in conjunction with the rise.
(4)韻律Pxおよび韻律Pyの第4例はスペクトル幅である。スペクトル幅は、例えば音声の周波数スペクトルの包絡線(スペクトルエンベロープ)における最大値と最小値との差分である。発話音声Vxのスペクトル幅が経時的に増加するように利用者Uが発音すると、その増加に連動して応答音声Vyのスペクトル幅が増加する。 (4) The fourth example of the prosody Px and the prosody Py is a spectrum width. The spectrum width is, for example, the difference between the maximum value and the minimum value in the envelope (spectral envelope) of the frequency spectrum of speech. When the user U utters so that the spectrum width of the utterance voice Vx increases with time, the spectrum width of the response voice Vy increases in conjunction with the increase.
(5)韻律Pxおよび韻律Pyの第5例は音高幅である。音高幅は、発音期間内における音高の変動幅(すなわち発音期間内における音高の最大値と最小値との差分)である。利用者Uが発話音声Vxの音高幅を経時的に増加させると、その増加に連動して応答音声Vyの音高幅が増加する。 (5) The fifth example of the prosody Px and the prosody Py is the pitch range. The pitch range is the fluctuation range of the pitch within the sound generation period (that is, the difference between the maximum value and the minimum value of the pitch within the sound generation period). When the user U increases the pitch width of the speech voice Vx with time, the pitch width of the response speech Vy increases in conjunction with the increase.
(6)韻律Pxおよび韻律Pyの第6例は音量幅である。音量幅は、発音期間内における音量の変動幅(すなわち発音期間内における音量の最大値と最小値との差分)である。利用者Uが発話音声Vxの音量幅を経時的に増加させると、その増加に連動して応答音声Vyの音量幅が経時的に増加する。なお、音高幅および音量幅は、音声の抑揚(調子)に相当する。したがって、第5例および第6例では、発話音声Vxにおける抑揚の変化に連動して、応答音声Vyの抑揚が変化する。 (6) The sixth example of the prosody Px and the prosody Py is the volume width. The sound volume width is the fluctuation range of the sound volume within the sound generation period (that is, the difference between the maximum value and the minimum value of the sound volume within the sound generation period). When the user U increases the volume width of the utterance voice Vx with time, the volume width of the response voice Vy increases with time, in conjunction with the increase. Note that the pitch range and the volume range correspond to the intonation (tone) of the sound. Therefore, in the fifth and sixth examples, the intonation of the response voice Vy changes in conjunction with the change of intonation in the speech voice Vx.
(7)韻律Pxおよび韻律Pyの第7例は発話間隔である。発話間隔は、音声対話において相前後する2個の発音期間の間隔(前方の発音期間の終点から後方の発音期間の始点までの時間長)である。第1実施形態では、発話音声Vxの発音期間Txと応答音声Vyの発音期間Tyとの間隔が発音間隔に相当する。 (7) The seventh example of the prosody Px and the prosody Py is the speech interval. The speech interval is an interval between two successive sounding periods in the voice dialogue (a time length from the end of the front sounding period to the start of the rear sounding period). In the first embodiment, the interval between the sound generation period Tx of the speech voice Vx and the sound generation period Ty of the response sound Vy corresponds to the sound generation interval.
 例えば図5に例示される通り、応答音声Vyの第(n-2)番目の発音期間Ty_n-2と発話音声Vxの第(n-1)番目の発音期間Tx_n-1との発音間隔が韻律Px_n-1として特定され、応答音声Vyの第(n-1)番目の発音期間Ty_n-1と発話音声Vxの第n番目の発音期間Tx_nとの発音間隔が韻律Px_nとして特定された場合を想定する。韻律変化指標Dx_nは、韻律Px_nと韻律Px_n-1との差分に相当する時間長として算定される。 For example, as illustrated in FIG. 5, the pronunciation interval between the (n-2) -th pronunciation period Ty_n-2 of the response voice Vy and the (n-1) -th pronunciation period Tx_n-1 of the speech voice Vx is prosody It is assumed that the pronunciation interval between the (n-1) -th pronunciation period Ty_n-1 of the response voice Vy and the n-th pronunciation period Tx_n of the speech voice Vx is specified as prosody Px_n, which is specified as Px_n-1. Do. The prosody change index Dx_n is calculated as a time length corresponding to the difference between the prosody Px_n and the prosody Px_n-1.
 応答生成部36は、韻律変化指標Dx_nに応じた変化量Dy_nが発音期間Tx_nの終点から経過した時点で応答音声Vyの発音期間Ty_nが開始するように応答信号Yを生成する。すなわち、応答音声Vyの韻律Py_n(発音間隔)として変化量Dy_nが適用される。なお、韻律変化指標Dx_n(すなわち韻律Px_nと韻律Px_n-1との差分)と所定の初期値とに応じて変化量Dy_nを算定してもよい。例えば、韻律変化指標Dx_nと初期値との加算値を変化量Dy_nとして算定してもよい。以上の説明から理解される通り、韻律Pxおよび韻律Pyを発話間隔とした構成においても、発話音声Vxの韻律Pxの変化(韻律変化指標Dx_n)に応じた韻律Pyの応答音声Vyを表す応答信号Yが生成される。 The response generation unit 36 generates the response signal Y so that the sound generation period Ty_n of the response voice Vy starts when the change amount Dy_n corresponding to the prosody change index Dx_n elapses from the end point of the sound generation period Tx_n. That is, the variation Dy_n is applied as the prosody Py_n (pronunciation interval) of the response voice Vy. The change amount Dy_n may be calculated according to the prosody change index Dx_n (that is, the difference between the prosody Px_n and the prosody Px_n-1) and a predetermined initial value. For example, the addition value of the prosody change index Dx_n and the initial value may be calculated as the change amount Dy_n. As understood from the above description, even in the configuration in which the prosody Px and the prosody Py are the speech intervals, the response signal representing the response speech Vy of the prosody Py according to the change (prosody change index Dx_n) of the prosody Px of the speech voice Vx. Y is generated.
 なお、図5においては発音期間Tx_nと発音期間Ty_nとの発話間隔に着目したが、図5における発音期間Tx_n-1と発音期間Ty_n-1との発話間隔は、以上に説明したのと同様の手順で設定された韻律変化指標Dx_n-1に応じて設定される。また、音声対話の開始の当初において、相前後する2個の発音期間Txについて韻律Pxの差分を算定できない段階では、変化量Dyは所定の初期値に設定される。 In FIG. 5, attention is paid to the speech interval between the sound generation period Tx_n and the sound generation period Ty_n, but the speech period between the sound generation period Tx_n-1 and the sound generation period Ty_n-1 in FIG. 5 is the same as that described above. It is set according to the prosody change index Dx_n-1 set in the procedure. Further, at the beginning of the start of the speech dialogue, the change amount Dy is set to a predetermined initial value at a stage where the difference between the prosody Px can not be calculated for two successive sound generation periods Tx.
(8)韻律Pxおよび韻律Pyの第8例は発音期間の時間長(以下「発話長」という)である。発話長は、発音期間の始点から終点までの時間である。具体的には、図6に例示される通り、発話音声Vxの第(n-1)番目の発音期間Tx_n-1の時間長が韻律Px_n-1として特定され、発話音声Vxの第n番目の発音期間Tx_nの時間長が韻律Px_nとして特定された場合を想定する。韻律変化指標Dx_nは、韻律Px_nと韻律Px_n-1との差分に相当する時間長として算定される。なお、韻律変化指標Dx_n-1は、発話音声Vxの発音期間Tx_n-1について算定された韻律Px_n-1と、直前の発音期間Tx_n-2について算定された韻律Px_n-2との差分に応じて、以上に説明したのと同様の手順で算定される。 (8) The eighth example of the prosody Px and the prosody Py is the time length of the pronunciation period (hereinafter referred to as "speech length"). The utterance length is the time from the start to the end of the sound generation period. Specifically, as exemplified in FIG. 6, the time length of the (n-1) th pronunciation period Tx_n-1 of the speech Vx is specified as the prosody Px_n-1, and the n-th speech speech Vx is specified. It is assumed that the time length of the pronunciation period Tx_n is specified as prosody Px_n. The prosody change index Dx_n is calculated as a time length corresponding to the difference between the prosody Px_n and the prosody Px_n-1. The prosody change index Dx_n-1 corresponds to the difference between the prosody Px_n-1 calculated for the pronunciation period Tx_n-1 of the utterance voice Vx and the prosody Px_n-2 calculated for the immediately preceding pronunciation period Tx_n-2. , Calculated in the same way as described above.
 応答生成部36は、発音期間Tx_nの発話音声Vxに対する応答音声Vyの韻律Py_n(すなわち発話長)が、韻律変化指標Dx_nに応じた時間長(変化量Dy_n)となるように、応答信号Yを生成する。すなわち、応答音声Vyの韻律Py_nとして変化量Dy_nが適用される。なお、例えば、韻律変化指標Dx_nと初期値との加算値を変化量Dy_nとして算定してもよい。以上の説明から理解される通り、韻律Pxおよび韻律Pyを発話長とした構成においても、発話音声Vxの韻律Pxの変化(韻律変化指標Dx_n)に応じた韻律Pyの応答音声Vyを表す応答信号Yが生成される。なお、音声対話の開始の当初において、相前後する2個の発音期間Txについて韻律Pxの差分を算定できない段階では、変化量Dyは所定の初期値に設定される。 The response generation unit 36 sets the response signal Y so that the prosody Py_n (that is, the utterance length) of the response voice Vy with respect to the utterance voice Vx in the pronunciation period Tx_n becomes a time length (change amount Dy_n) according to the prosody change index Dx_n. Generate That is, the variation Dy_n is applied as the prosody Py_n of the response speech Vy. Note that, for example, the addition value of the prosody change index Dx_n and the initial value may be calculated as the change amount Dy_n. As understood from the above description, even in the configuration in which the prosody Px and the prosody Py are the speech length, the response signal representing the response speech Vy of the prosody Py corresponding to the change (prosody change index Dx_n) of the prosody Px of the speech voice Vx. Y is generated. The change amount Dy is set to a predetermined initial value at the stage where the difference between the prosody Px can not be calculated for two successive sounding periods Tx at the beginning of the start of the speech dialogue.
<第2実施形態>
 本発明の第2実施形態を説明する。なお、以下に例示する各態様において作用または機能が第1実施形態と同様である要素については、第1実施形態の説明で使用した符号を流用して各々の詳細な説明を適宜に省略する。
Second Embodiment
A second embodiment of the present invention will be described. In addition, about the element which an operation | movement or a function is the same as 1st Embodiment in each aspect illustrated below, the code | symbol used by description of 1st Embodiment is diverted and detailed description of each is abbreviate | omitted suitably.
 第1実施形態の応答生成部36は、発話音声Vxの韻律Pxの時間的な変化に応じた韻律Pyの応答音声Vyを表す応答信号Yを生成する。第2実施形態の応答生成部36は、発話音声Vxの韻律Pxの数値に応じた韻律Pyの応答音声Vyを表す応答信号Yを生成する。すなわち、第1実施形態では、韻律Pxの相対値(すなわち韻律変化指標Dx)に応じて応答音声Vyの韻律Pyが制御されるのに対し、第2実施形態では、韻律Pxの1個の数値に応じて応答音声Vyの韻律Pyが制御される。なお、第2実施形態でも第1実施形態と同様に、応答生成部36は、記憶装置22に記憶された音声信号Zの韻律Pzを韻律Pyに調整することで応答信号Yを生成する。また、韻律Pyは韻律Pxと同種の特徴量であるが数値は相違する。 The response generation unit 36 of the first embodiment generates a response signal Y representing the response speech Vy of the prosody Py according to the temporal change of the prosody Px of the speech speech Vx. The response generation unit 36 of the second embodiment generates a response signal Y representing the response speech Vy of the prosody Py according to the value of the prosody Px of the speech speech Vx. That is, in the first embodiment, the prosody Py of the response speech Vy is controlled according to the relative value of the prosody Px (that is, the prosody change index Dx), while in the second embodiment, one numerical value of the prosody Px The prosody Py of the response speech Vy is controlled in accordance with. In the second embodiment, as in the first embodiment, the response generation unit 36 generates the response signal Y by adjusting the prosody Pz of the audio signal Z stored in the storage device 22 to the prosody Py. Also, the prosody Py is a feature of the same type as the prosody Px, but the numerical values are different.
 なお、第2実施形態における韻律Pxおよび韻律Pyの具体例は、第1実施形態と同様である。例えば、音高,音量,話速,スペクトル幅,音高幅,音量幅,発話間隔および発話長が、韻律Pxおよび韻律Pyの好適例である。また、音高または音量等の韻律の時間的な変化の傾向を示す指標値(例えば増加率または減少率等の変化率)を韻律Pxおよび韻律Pyとして採用してもよい。 Note that specific examples of the prosody Px and the prosody Py in the second embodiment are the same as in the first embodiment. For example, pitch, volume, speech speed, spectrum width, pitch width, volume width, speech interval and speech length are preferred examples of prosody Px and prosody Py. Also, an index value (for example, a change rate such as an increase rate or a decrease rate) indicating a tendency of temporal change of prosody such as pitch or volume may be adopted as the prosody Px and the prosody Py.
 図7は、第2実施形態の制御装置20が実行する処理のフローチャートである。例えば音声対話装置100に対する利用者Uからの指示(例えば音声対話用のプログラムの起動指示)を契機として図7の処理が開始される。図7の処理を開始すると、音声解析部34は、音声入力装置24が生成した発話信号Xを解析することで、発話音声Vxの1個の発音期間について韻律Pxを特定する(Sb1)。 FIG. 7 is a flowchart of processing executed by the control device 20 according to the second embodiment. For example, the process of FIG. 7 is started in response to an instruction from the user U (for example, an instruction to start a program for voice dialogue) to the voice dialogue apparatus 100. When the process of FIG. 7 is started, the speech analysis unit 34 analyzes the speech signal X generated by the speech input device 24 to specify the prosody Px for one pronunciation period of the speech speech Vx (Sb1).
 応答生成部36は、韻律Pxに応じた韻律Pyの応答信号Yを生成する(Sb2)。具体的には、応答生成部36は、韻律Pyに応じて音声信号Zの韻律Pzを変化させることで、韻律Pyの応答音声Vyを表す応答信号Yを生成する。そして、応答生成部36は、以上の処理で生成した応答信号Yを再生装置26に供給することで応答音声Vyを再生する(Sb3)。 The response generation unit 36 generates a response signal Y of the prosody Py according to the prosody Px (Sb2). Specifically, the response generation unit 36 generates the response signal Y representing the response speech Vy of the prosody Py by changing the prosody Pz of the speech signal Z according to the prosody Py. Then, the response generation unit 36 reproduces the response voice Vy by supplying the response signal Y generated by the above processing to the reproduction device 26 (Sb3).
 応答音声Vyの再生が完了すると、制御装置は、音声対話の終了が利用者Uから指示されたか否かを判定する(Sb4)。音声対話の終了が指示されていない場合(Sb4:NO)、処理はステップSb1に遷移する。すなわち、発話音声Vxの韻律Pxの特定(Sb1)と、韻律Pxに応じた韻律Pyの応答信号Yの生成(Sb2)と、応答音声Vyの再生(Sb3)とが、発話音声Vxの発音期間Tx毎に反復される。したがって、第1実施形態と同様に、利用者Uによる任意の発話音声Vxの発音と、当該発話音声Vxに対する応答音声Vyの再生とが交互に反復される音声対話が実現される。 When the reproduction of the response voice Vy is completed, the control device determines whether the end of the voice dialogue has been instructed by the user U (Sb4). When the end of the voice interaction is not instructed (Sb4: NO), the process transitions to step Sb1. That is, the specification (Sb1) of the prosody Px of the speech Vx, the generation (Sb2) of the response signal Y of the prosody Py according to the prosody Px, and the reproduction (Sb3) of the response speech Vy It is repeated every Tx. Therefore, as in the first embodiment, a voice dialogue is realized in which the pronunciation of the arbitrary utterance voice Vx by the user U and the reproduction of the response voice Vy to the utterance voice Vx are alternately repeated.
 以上に説明した通り、第2実施形態では、発話音声Vxの韻律Pxに応じた韻律Pyの応答音声Vyを表す応答信号Yが生成される。したがって、発話音声の韻律の変化に対話相手の応答音声の韻律が連動するという現実の対話の傾向を模擬した自然な音声対話を実現することが可能である。 As described above, in the second embodiment, the response signal Y representing the response speech Vy of the prosody Py according to the prosody Px of the speech speech Vx is generated. Therefore, it is possible to realize a natural speech dialogue that simulates the tendency of a real dialogue in which the prosody of the response voice of the dialogue partner is interlocked with the change of the prosody of the speech.
<変形例>
 以上に例示した各態様に付加される具体的な変形の態様を以下に例示する。以下の例示から任意に選択された2個以上の態様を、相互に矛盾しない範囲で適宜に併合してもよい。
<Modification>
The aspect of the specific modification added to each aspect illustrated above is illustrated below. Two or more embodiments arbitrarily selected from the following exemplifications may be combined appropriately as long as they do not contradict each other.
(1)前述の各形態では、韻律変化指標Dx_nに応じた変化量Dy_nだけ音声信号Zの韻律Pzを変化させることで、韻律Pyの応答音声Vyを表す応答信号Yを生成したが、韻律変化指標Dx_nと直前の応答音声Vyの韻律Py_n-1とに応じて現在の応答音声Vyの韻律Py_nを設定してもよい。具体的には、応答生成部36は、韻律変化指標Dx_nに応じて韻律Py_n-1を変化させた数値を韻律Py_nとして設定する。例えば、韻律Py_n-1に韻律変化指標Dx_nを加算した数値が韻律Py_nとして設定される。以上の構成においても、発話音声Vxの韻律Pxの変化に応じた韻律Pyの応答音声Vyを表す応答信号Yを生成できる。 (1) In each of the above-described embodiments, the response signal Y representing the response voice Vy of the prosody Py is generated by changing the prosody Pz of the speech signal Z by the change amount Dy_n corresponding to the prosody change index Dx_n. The prosody Py_n of the current response speech Vy may be set according to the index Dx_n and the prosody Py_n-1 of the immediately preceding response speech Vy. Specifically, the response generation unit 36 sets, as the prosody Py_n, a numerical value obtained by changing the prosody Py_n-1 according to the prosody change index Dx_n. For example, a value obtained by adding the prosody change index Dx_n to the prosody Py_n−1 is set as the prosody Py_n. Also in the above configuration, it is possible to generate the response signal Y representing the response speech Vy of the prosody Py according to the change of the prosody Px of the speech speech Vx.
(2)前述の各形態では、発話音声Vxの韻律Pxに応じて応答音声Vyの同種の韻律Pyを制御したが、発話音声Vxの韻律Pxと当該韻律Pxに応じて制御される応答音声Vyの韻律Pyとを相異なる種類の特徴量としてもよい。例えば、発話音声Vxの音高(韻律Px)の変化に応じて応答音声Vyの音量(韻律Py)を制御してもよい。 (2) In each of the above-described embodiments, the same prosody Py of the response voice Vy is controlled according to the prosody Px of the utterance voice Vx, but the response speech Vy controlled according to the prosody Px of the utterance voice Vx and the prosody Px The prosody Py may be used as different types of feature quantities. For example, the volume (prosody Py) of the response voice Vy may be controlled according to the change of the pitch (prosody Px) of the speech voice Vx.
(3)前述の各形態では、発話音声Vxの韻律Pxに応じて応答音声Vyの韻律Pyを制御したが、応答音声Vyの複数種の韻律Pyを発話音声Vxの1種類の韻律Pxに応じて制御してもよい。例えば、音高,音量,話速,スペクトル幅,音高幅,音量幅,発話間隔および発話長から任意に選択された2以上の韻律Pyが、発話音声Vxの1種類の韻律Pxに応じて制御される。韻律Pxに応じて制御される応答音声Vyの韻律Pyの組合せ(種類および総数)は任意である。 (3) In each of the above-described embodiments, the prosody Py of the response voice Vy is controlled according to the prosody Px of the utterance voice Vx, but plural types of prosody Py of the response voice Vy are controlled according to one prosody Px of the utterance voice Vx. May be controlled. For example, two or more prosody Py arbitrarily selected from pitch, volume, speech speed, spectrum width, pitch width, volume width, speech interval and speech length correspond to one kind of prosody Px of speech voice Vx. It is controlled. The combination (type and total number) of the prosody Py of the response speech Vy controlled according to the prosody Px is arbitrary.
 発話音声Vxの複数種の韻律Pxに応じて応答音声Vyの韻律Pyを制御してもよい。例えば、音高,音量,話速,スペクトル幅,音高幅,音量幅,発話間隔および発話長から任意に選択された2以上の韻律Pxが発話音声Vxから特定され、応答音声Vyの1種類の韻律Pyの制御に利用される。複数種の韻律Pxに応じて複数種の韻律Pyを制御してもよい。以上の説明から理解される通り、応答音声Vyの韻律Pyの制御に適用される発話音声Vxの韻律Pxの組合せ(種類および総数)は任意である。 The prosody Py of the response speech Vy may be controlled according to a plurality of prosody Px of the speech speech Vx. For example, two or more prosody Px arbitrarily selected from pitch, volume, speech speed, spectrum width, pitch width, volume width, speech interval and speech length are specified from speech voice Vx, and one type of response speech Vy It is used to control the prosody Py. Plural types of prosody Py may be controlled according to plural types of prosody Px. As understood from the above description, the combination (type and total number) of the prosody Px of the speech voice Vx applied to the control of the prosody Py of the response speech Vy is arbitrary.
(4)前述の各形態では、発話音声Vxの韻律Pxに応じて応答音声Vyの韻律Pyを制御したが、発話音声Vxの韻律Px以外の要素を応答音声Vyの韻律Pyの制御に適用してもよい。例えば、発話音声Vxの韻律Pxと、韻律Pxとは無関係に設定された補正値(オフセット)とに応じて、応答音声Vyの韻律Pyを制御してもよい。例えば、韻律Pxに応じて設定された暫定値に補正値を加算することで最終的な韻律Pyが算定される。補正値は、固定値および可変値の何れでもよい。例えば、音声対話装置100を利用した音声対話の時間が長いほど補正値を減少させてもよい。 (4) In the above-described embodiments, the prosody Py of the response voice Vy is controlled according to the prosody Px of the utterance voice Vx, but an element other than the prosody Px of the utterance voice Vx is applied to control of the prosody Py of the response voice Vy. May be For example, the prosody Py of the response voice Vy may be controlled according to the prosody Px of the speech voice Vx and the correction value (offset) set independently of the prosody Px. For example, the final prosody Py is calculated by adding the correction value to the provisional value set according to the prosody Px. The correction value may be either a fixed value or a variable value. For example, the correction value may be decreased as the time of speech dialogue using the speech dialogue apparatus 100 is longer.
(5)応答音声Vyの韻律Pyを所定の範囲に制限してもよい。例えば、発話音声Vxの韻律Pxに応じて算定された韻律の暫定値が所定の閾値を上回る場合(または下回る場合)には、当該閾値が韻律Pyとして採択される。以上の構成によれば、応答音声Vyの韻律Pyが異常値となり音声対話が不自然となる可能性を低減することが可能である。また、例えば発話音声Vxの韻律Pxに応じて算定された韻律の暫定値が所定の閾値を上回る場合(または下回る場合)に、発話に対する問返し(聞き直し)を表す応答音声Vyを生成してもよい。 (5) The prosody Py of the response speech Vy may be limited to a predetermined range. For example, when the provisional value of the prosody calculated according to the prosody Px of the speech Vx exceeds (or falls below) a predetermined threshold, the threshold is adopted as the prosody Py. According to the above configuration, it is possible to reduce the possibility that the prosody Py of the response voice Vy becomes an abnormal value and the voice dialogue becomes unnatural. Further, for example, when the provisional value of the prosody calculated in accordance with the prosody Px of the utterance voice Vx exceeds (or falls below) a predetermined threshold, a response voice Vy representing a questioning (rehearing) for the utterance is generated. It is also good.
(6)第1実施形態では、発話音声Vxの発音期間Tx_nの韻律Px_nと直前の発音期間Tx_n-1の韻律Px_n-1との差分を韻律変化指標Dx_nとして算定したが、韻律Px_nの変化の基準となる数値は、直前の発音期間Tx_n-1の韻律Px_n-1に限定されない。例えば、直前の発音期間Tx_n-1以外の発音期間Tx(例えば2個以上前の発音期間Tx)の韻律Pxに対する韻律Px_nの変化を、韻律変化指標Dx_nとして算定してもよい。また、3個以上の発音期間Txにわたる韻律Pxの変化に応じて韻律変化指標Dx_nを算定してもよい。例えば、過去の複数の発音期間Txにわたる韻律Pxの代表値(例えば平均値)に対する現時点の韻律Px_nの変化に応じて韻律変化指標Dx_nを算定してもよい。 (6) In the first embodiment, the difference between the prosody Px_n in the pronunciation period Tx_n of the speech Vx and the prosody Px_n-1 in the immediately preceding pronunciation period Tx_n-1 is calculated as the prosody change index Dx_n. The reference numerical value is not limited to the prosody Px_n-1 of the immediately preceding pronunciation period Tx_n-1. For example, a change in prosody Px_n with respect to the prosody Px in a sound generation period Tx (for example, two or more previous sound generation periods Tx) other than the last sound generation period Tx_n-1 may be calculated as the prosody change index Dx_n. Further, the prosody change index Dx_n may be calculated according to the change in the prosody Px over three or more pronunciation periods Tx. For example, the prosody change index Dx_n may be calculated according to a change in the current prosody Px_n with respect to a representative value (for example, an average value) of the prosody Px over a plurality of pronunciation periods Tx in the past.
(7)第1実施形態では、発話音声Vxに関する韻律Px_nと韻律Px_n-1との差分を韻律変化指標Dx_nとして算定したが、韻律変化指標Dx_nの算定方法は以上の例示に限定されない。例えば、韻律Px_nと韻律Px_n-1との比を韻律変化指標Dx_n(Dx_n=Px_n/Px_n-1)として算定してもよい。すなわち、韻律変化指標Dx_nは、発話音声Vxの韻律Pxの変化に応じた指標として包括的に表現される。 (7) In the first embodiment, the difference between the prosody Px_n related to the speech Vx and the prosody Px_n-1 is calculated as the prosody change index Dx_n, but the calculation method of the prosody change index Dx_n is not limited to the above example. For example, the ratio between the prosody Px_n and the prosody Px_n-1 may be calculated as the prosody change index Dx_n (Dx_n = Px_n / Px_n-1). That is, the prosody change index Dx_n is comprehensively expressed as an index according to the change of the prosody Px of the speech voice Vx.
(8)前述の各形態では、発話音声Vxの発音期間Tx_nの韻律Px_nと直前の発音期間Tx_n-1の韻律Px_n-1との差分(韻律変化指標Dx_n)に応じて応答音声Vyの韻律Pyを設定したが、応答音声Vyの韻律Pyに反映される変数は、韻律変化指標Dx_nに限定されない。例えば、韻律変化指標Dx_nと直前の応答音声Vyの韻律Py_n-1とに応じて現在の応答音声Vyの韻律Py_nを設定してもよい。また、過去の複数の応答音声Vyにおける韻律の差分(Py_n-2-Py_n-1)を韻律変化指標Dx_nとともに応答音声Vyの韻律Py_nの設定に適用してもよい。 (8) In each embodiment described above, the prosody Py of the response voice Vy according to the difference between the prosody Px_n of the pronunciation period Tx_n of the utterance voice Vx and the prosody Px_n-1 of the immediately preceding pronunciation period Tx_n-1 (prosody change index Dx_n). However, the variable reflected to the prosody Py of the response speech Vy is not limited to the prosody change index Dx_n. For example, the prosody Py_n of the current response speech Vy may be set according to the prosody change index Dx_n and the prosody Py_n-1 of the immediately preceding response speech Vy. Further, the prosody difference (Py_n-2-Py_n-1) in the plurality of response voices Vy in the past may be applied to the setting of the prosody Py_n of the response voice Vy together with the prosody change index Dx_n.
(9)前述の各形態では、記憶装置22に記憶された音声信号Zから応答信号Yを生成および再生したが、特定の発話内容の応答音声Vyを表す応答信号Yを、例えば公知の音声合成技術により合成することも可能である。応答信号Yの合成には、例えば、素片接続型の音声合成、または、隠れマルコフモデル等の統計モデルを利用した音声合成が好適に利用される。また、発話音声Vxおよび応答音声Vyは人間の発声音に限定されない。例えば動物の鳴き声を発話音声Vxおよび応答音声Vyとすることも可能である。 (9) In each of the above-described embodiments, the response signal Y is generated and reproduced from the voice signal Z stored in the storage device 22, but the response signal Y representing the response voice Vy of a specific utterance content is It is also possible to synthesize by technology. For synthesis of the response signal Y, for example, speech synthesis using segment connection type speech or speech synthesis using a statistical model such as a hidden Markov model is preferably used. Also, the speech voice Vx and the response speech Vy are not limited to human speech. For example, it is also possible to use the vocalization of an animal as the utterance voice Vx and the response voice Vy.
(10)前述の各形態では、音声対話装置100が音声入力装置24と再生装置26とを具備する構成を例示したが、音声対話装置100とは別体の装置(音声入出力装置)に音声入力装置24および再生装置26を設置することも可能である。音声対話装置100は、例えば携帯電話機またはスマートフォン等の端末装置で実現され、音声入出力装置は、例えば動物型の玩具またはロボット等の電子機器で実現される。音声対話装置100と音声入出力装置とは無線または有線で通信可能である。すなわち、音声入出力装置の音声入力装置24が生成した発話信号Xは無線または有線で音声対話装置100に送信され、音声対話装置100が生成した応答信号Yは無線または有線で音声入出力装置の再生装置26に送信される。 (10) In each of the above-described embodiments, the voice interactive apparatus 100 exemplifies a configuration in which the voice input apparatus 24 and the reproduction apparatus 26 are provided. However, the voice interactive apparatus 100 separates voice from the voice communication apparatus 100 It is also possible to install the input device 24 and the playback device 26. The voice interaction device 100 is realized by a terminal device such as a mobile phone or a smart phone, for example, and the voice input / output device is realized by an electronic device such as an animal type toy or a robot. The voice interaction device 100 and the voice input / output device can communicate wirelessly or by wire. That is, the speech signal X generated by the voice input device 24 of the voice input / output device is transmitted to the voice interaction device 100 wirelessly or by wire, and the response signal Y generated by the voice interaction device 100 is wirelessly or by wire It is sent to the playback device 26.
(11)前述の各形態では、携帯電話機等またはパーソナルコンピュータ等の情報処理装置で音声対話装置100を実現したが、音声対話装置100の一部または全部の機能をサーバ装置(いわゆるクラウドサーバ)で実現することも可能である。具体的には、移動通信網またはインターネット等の通信網を介して端末装置と通信するサーバ装置により音声対話装置100が実現される。例えば、音声対話装置100は、端末装置の音声入力装置24が生成した発話信号Xを当該端末装置から受信し、前述の各形態に係る構成により発話信号Xから応答信号Yを生成する。そして、音声対話装置100は、発話信号Xから生成した応答信号Yを端末装置に送信し、当該端末装置の再生装置26に応答音声Vyを再生させる。音声対話装置100は、単体の装置または複数の装置の集合(すなわちサーバシステム)で実現される。音声対話装置100が実現する各機能をサーバ装置および端末装置の何れで実現するか(機能の分担)は任意である。 (11) In the above-described embodiments, the voice interaction apparatus 100 is realized by an information processing apparatus such as a cellular phone or a personal computer. However, part or all of the function of the voice interaction apparatus 100 may be implemented by a server device (so-called cloud server). It is also possible to realize. Specifically, the voice interactive apparatus 100 is realized by a server device that communicates with the terminal device via a communication network such as a mobile communication network or the Internet. For example, the voice interaction device 100 receives the speech signal X generated by the speech input device 24 of the terminal device from the terminal device, and generates a response signal Y from the speech signal X by the configuration according to each of the above-described embodiments. Then, the voice interaction device 100 transmits the response signal Y generated from the speech signal X to the terminal device, and causes the reproduction device 26 of the terminal device to reproduce the response voice Vy. The voice interaction device 100 is realized by a single device or a set of a plurality of devices (ie, a server system). Whether each function realized by the voice interaction device 100 is realized by the server device or the terminal device (allocation of functions) is optional.
(12)前述の各形態では、発話音声Vxに対して特定の発話内容(例えば「うん」等の相鎚)の応答音声Vyを再生したが、応答音声Vyの発話内容は以上の例示に限定されない。例えば、発話信号Xに対する音声認識および形態素解析で発話音声Vxの発話内容を解析し、当該発話内容に対して適切な内容の応答音声Vyを複数の候補から選択または合成して再生装置26に再生させることも可能である。なお、音声認識および形態素解析を実行しない構成では、発話音声Vxとは無関係に事前に用意された発話内容の応答音声Vyが再生される。したがって、単純に考えると、自然な対話は成立しないようにも推測され得るが、前述の各形態の例示のように応答音声Vyの韻律が多様に制御されることで、実際には、人間同士の自然な対話のような感覚を利用者Uは感取することが可能である。他方、音声認識および形態素解析を実行しない構成によれば、これらの処理に起因した処理遅延および処理負荷が低減ないし解消されるという利点がある。 (12) In each of the above-described embodiments, the response voice Vy of a specific utterance content (for example, a compliment such as "う") is reproduced with respect to the utterance voice Vx, but the utterance content of the response voice Vy is limited to the above example. I will not. For example, the utterance content of the utterance voice Vx is analyzed by speech recognition and morphological analysis on the utterance signal X, and the response voice Vy having an appropriate content for the utterance content is selected or synthesized from a plurality of candidates and reproduced on the reproduction device 26 It is also possible to In the configuration in which the speech recognition and the morphological analysis are not performed, the response speech Vy of the speech content prepared in advance regardless of the speech speech Vx is reproduced. Therefore, simply thinking, natural dialogue may be inferred as not being established, but as in the above-described respective forms, the prosody of the response speech Vy is variously controlled, and in fact, it is possible to communicate between humans. It is possible for the user U to sense the sense of natural dialogue of On the other hand, according to the configuration in which the speech recognition and morphological analysis are not performed, there is an advantage that the processing delay and the processing load due to these processes can be reduced or eliminated.
(13)前述の各形態では、音声信号Zの韻律Pzを調整することで応答音声Vyの応答信号Yを生成したが、応答信号Yの生成方法は以上の例示に限定されない。例えば、韻律Pzが相違する複数の音声信号Zを記憶装置22に記憶しておき、複数の音声信号Zのうち韻律変化指標Dxに応じた韻律の数値(以下「目標値」という)に最も近い韻律Pzの音声信号Zを応答信号Yとして選択することも可能である。すなわち、複数の候補(音声信号Z)から応答信号Yを選択する処理は、応答信号Yを生成する処理の一例である。また、複数の音声信号Zのうち韻律Pzが目標値に近い順番で選択した2以上の音声信号Zから応答信号Yを生成してもよい。例えば、2以上の音声信号Zの加重和または補間により応答信号Yが生成される。 (13) In the above-described embodiments, the response signal Y of the response voice Vy is generated by adjusting the prosody Pz of the voice signal Z, but the method of generating the response signal Y is not limited to the above example. For example, a plurality of speech signals Z having different prosody Pz are stored in the storage unit 22, and the plurality of speech signals Z are closest to a prosody numerical value (hereinafter referred to as "target value") corresponding to the prosody change index Dx. It is also possible to select the speech signal Z of the prosody Pz as the response signal Y. That is, the process of selecting the response signal Y from the plurality of candidates (the audio signal Z) is an example of the process of generating the response signal Y. Alternatively, the response signal Y may be generated from two or more audio signals Z selected in the order in which the prosody Pz is close to the target value among the plurality of audio signals Z. For example, the response signal Y is generated by weighted sum or interpolation of two or more audio signals Z.
(14)前述の各形態で例示した音声対話装置100を、実際の人間同士の対話の評価に利用することも可能である。例えば、実際の人間同士の対話で観測される応答音声(以下「観測音声」という)の韻律を、前述の形態で生成された応答音声Vyの韻律と比較し、両者間で韻律が類似する場合には観測音声を適切と評価する一方、両者間で韻律が乖離する場合には観測音声を不適切と評価することが可能である。以上に例示した評価を実行する装置(対話評価装置)は、人間同士の対話の訓練に利用してもよい。 (14) It is also possible to use the voice interaction apparatus 100 exemplified in each of the above-described embodiments for evaluation of actual human interaction. For example, comparing the prosody of the response speech (hereinafter referred to as "observed speech") observed in an actual human interaction with the prosody of the response speech Vy generated in the above-described form, the prosody is similar between the two It is possible to evaluate the observation speech as appropriate while evaluating the observation speech as appropriate, but when the prosody diverges between the two. The apparatus (interaction evaluation apparatus) which performs the evaluation illustrated above may be used for training of human dialogue.
(15)前述の各形態で例示した音声対話装置100は、前述の通り、制御装置20と音声対話用のプログラムとの協働で実現される。 (15) The voice interaction device 100 exemplified in each of the above-described embodiments is realized by the cooperation of the control device 20 and the program for voice interaction as described above.
 本発明の第1態様(例えば第1実施形態)に係るプログラムは、コンピュータに、第1音声信号が表す第1音声の特徴量を発音期間毎に特定する音声解析処理(Sa1)と、複数の発音期間における前記第1音声の特徴量の変化に応じた特徴量の第2音声を表す第2音声信号を生成する応答生成処理(Sa2およびsa3)とを実行させる。また、本発明の第2態様(例えば第2実施形態)に係るプログラムは、コンピュータに、第1音声信号が表す第1音声の特徴量を特定する音声解析処理(Sb1)と、前記第1音声の特徴量に応じた特徴量の第2音声を表す第2音声信号を生成する応答生成処理(Sb2)とを実行させる。 A program according to a first aspect (for example, the first embodiment) of the present invention includes a voice analysis process (Sa1) for specifying a feature quantity of a first voice represented by a first voice signal for each pronunciation period in a computer; And response generation processing (Sa2 and sa3) for generating a second audio signal representing a second audio of the feature amount according to the change of the feature amount of the first audio during the sound generation period. Further, a program according to a second aspect (for example, the second embodiment) of the present invention includes a voice analysis process (Sb1) for specifying a feature amount of a first voice represented by a first voice signal in a computer; And a response generation process (Sb2) for generating a second audio signal representing a second sound of the feature amount according to the feature amount of
 以上の各態様に係るプログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされる。記録媒体は、例えば非一過性(non-transitory)の記録媒体であり、CD-ROM等の光学式記録媒体(光ディスク)が好例であるが、半導体記録媒体または磁気記録媒体等の公知の任意の形式の記録媒体を包含し得る。なお、「非一過性の記録媒体」とは、一過性の伝搬信号(transitory, propagating signal)を除く全てのコンピュータ読み取り可能な記録媒体を含み、揮発性の記録媒体を除外するものではない。また、通信網を介した配信の形態でプログラムをコンピュータに配信してもよい。 The program according to each of the above aspects is provided in a form stored in a computer readable recording medium and installed in the computer. The recording medium is, for example, a non-transitory recording medium, and is preferably an optical recording medium (optical disc) such as a CD-ROM, but any known medium such as a semiconductor recording medium or a magnetic recording medium may be used. Recording media of the form Note that "non-transitory recording medium" includes all computer readable recording media except transient propagation signals, and does not exclude volatile recording media. . Also, the program may be distributed to the computer in the form of distribution via a communication network.
(16)以上に例示した形態から、例えば以下の構成が把握される。
<態様1>
 本発明の好適な態様(態様1)に係る音声処理方法は、コンピュータが、第1音声信号が表す第1音声の特徴量を発音期間毎に特定し、複数の発音期間における前記第1音声の特徴量の変化に応じた特徴量の第2音声を表す第2音声信号を生成する。以上の態様では、第1音声の特徴量の変化に応じた特徴量の第2音声を表す第2音声信号が生成される。したがって、例えば発話音声の特徴量の変化に対話相手の応答音声の特徴量が連動するという現実の対話の傾向を模擬した自然な音声対話を実現することが可能である。
(16) The following configuration, for example, can be grasped from the embodiments exemplified above.
<Aspect 1>
In the sound processing method according to a preferred aspect (aspect 1) of the present invention, the computer specifies, for each sound generation period, the feature amount of the first sound represented by the first sound signal, the first sound in a plurality of sound generation periods A second audio signal representing a second voice of the feature amount according to the change of the feature amount is generated. In the above aspect, the second audio signal representing the second audio of the feature amount according to the change of the feature amount of the first audio is generated. Therefore, for example, it is possible to realize a natural voice dialogue that simulates the tendency of a real dialogue in which the feature quantity of the response voice of the dialogue partner is interlocked with the change of the feature quantity of the uttered voice.
 本発明の好適な態様(態様2)に係る音声処理装置は、第1音声信号が表す第1音声の特徴量を発音期間毎に特定する音声解析部と、複数の発音期間における前記第1音声の特徴量の変化に応じた特徴量の第2音声を表す第2音声信号を生成する応答生成部とを具備する。以上の態様では、第1音声の特徴量の変化に応じた特徴量の第2音声を表す第2音声信号が生成される。したがって、例えば発話音声の特徴量の変化に対話相手の応答音声の特徴量が連動するという現実の対話の傾向を模擬した自然な音声対話を実現することが可能である。 A speech processing apparatus according to a preferred aspect (aspect 2) of the present invention comprises: a speech analysis unit for specifying a feature quantity of a first speech represented by a first speech signal for each pronunciation period; and the first speech in a plurality of pronunciation periods And a response generation unit configured to generate a second audio signal representing a second voice of the feature amount according to the change of the feature amount of In the above aspect, the second audio signal representing the second audio of the feature amount according to the change of the feature amount of the first audio is generated. Therefore, for example, it is possible to realize a natural voice dialogue that simulates the tendency of a real dialogue in which the feature quantity of the response voice of the dialogue partner is interlocked with the change of the feature quantity of the uttered voice.
<他の態様>
 態様1または態様2の好適例において、前記第1音声の特徴量、および、当該特徴量の変化に応じた前記第2音声の特徴量は、音高、音量、話速、スペクトル幅(スペクトル包絡の変動量)、発音期間内における音高の変動幅、発音期間内における音量の変動幅、相前後する発音期間の間隔、および、発音期間の時間長のうちの少なくともひとつを含む。
<Other aspects>
In the preferable example of aspect 1 or aspect 2, the feature quantity of the first voice and the feature quantity of the second voice according to the change of the feature quantity are the pitch, the volume, the speech speed, the spectrum width (spectrum envelope (A variation amount of), a variation range of pitch within a sound generation period, a variation range of a sound volume within a sound generation period, an interval of successive sound generation periods, and at least one of a time length of the sound generation period.
100……音声対話装置、20……制御装置、22……記憶装置、24……音声入力装置、242……収音装置、244……A/D変換器、26……再生装置、262……D/A変換器、264……放音装置、32……音声取得部、34……音声解析部、36……応答生成部。 100: voice interaction device 20: control device 22: storage device 24: voice input device 242: sound collection device 244: A / D converter 26: playback device 262 ... D / A converter, 264 ... sound emitting device, 32 ... voice acquisition unit, 34 ... voice analysis unit, 36 ... response generation unit.

Claims (10)

  1.  第1音声信号が表す第1音声の特徴量を発音期間毎に特定し、
     複数の発音期間における前記第1音声の特徴量の変化に応じた特徴量の第2音声を表す第2音声信号を生成する
     コンピュータにより実現される音声処理方法。
    Identifying the feature amount of the first voice represented by the first voice signal for each sound generation period;
    A computer-implemented audio processing method for generating a second audio signal representing a second audio of a feature according to a change of the feature of the first audio in a plurality of sound generation periods.
  2.  前記第2音声の特徴量は、音高である
     請求項1の音声処理方法。
    The voice processing method according to claim 1, wherein the feature amount of the second voice is a pitch.
  3.  前記第2音声の特徴量は、音量である
     請求項1または請求項2の音声処理方法。
    The audio processing method according to claim 1, wherein the feature amount of the second audio is a volume.
  4.  前記第2音声の特徴量は、話速である
     請求項1から請求項3の何れかの音声処理方法。
    The voice processing method according to any one of claims 1 to 3, wherein the feature amount of the second voice is a speech speed.
  5.  前記第2音声の特徴量は、スペクトル包絡の変動量であるスペクトル幅である
     請求項1から請求項4の何れかの音声処理方法。
    The voice processing method according to any one of claims 1 to 4, wherein the feature quantity of the second voice is a spectrum width which is a variation of a spectrum envelope.
  6.  前記第2音声の特徴量は、発音期間内における音高の変動幅である
     請求項1から請求項5の何れかの音声処理方法。
    The voice processing method according to any one of claims 1 to 5, wherein the feature amount of the second voice is a fluctuation range of a pitch within a sound generation period.
  7.  前記第2音声の特徴量は、発音期間内における音量の変動幅である
     請求項1から請求項6の何れかの音声処理方法。
    The voice processing method according to any one of claims 1 to 6, wherein the feature amount of the second voice is a fluctuation range of a volume within a sound generation period.
  8.  前記第2音声の特徴量は、相前後する発音期間の間隔である
     請求項1から請求項7の何れかの音声処理方法。
    The voice processing method according to any one of claims 1 to 7, wherein the feature amount of the second voice is an interval between successive sound generation periods.
  9.  前記第2音声の特徴量は、発音期間の時間長である
     請求項1から請求項8の何れかの音声処理方法。
    The voice processing method according to any one of claims 1 to 8, wherein the feature amount of the second voice is a time length of a sound generation period.
  10.  第1音声信号が表す第1音声の特徴量を発音期間毎に特定する音声解析部と、
     複数の発音期間における前記第1音声の特徴量の変化に応じた特徴量の第2音声を表す第2音声信号を生成する応答生成部と
     を具備する音声処理装置。
    A voice analysis unit that specifies the feature quantity of the first voice represented by the first voice signal for each pronunciation period;
    A voice processing apparatus comprising: a response generation unit that generates a second voice signal representing a second voice of a feature amount according to a change in the feature amount of the first voice in a plurality of sound generation periods.
PCT/JP2018/034010 2017-09-25 2018-09-13 Speech processing method and speech processing device WO2019059094A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2017-183546 2017-09-25
JP2017183546A JP2019060941A (en) 2017-09-25 2017-09-25 Voice processing method

Publications (1)

Publication Number Publication Date
WO2019059094A1 true WO2019059094A1 (en) 2019-03-28

Family

ID=65810887

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2018/034010 WO2019059094A1 (en) 2017-09-25 2018-09-13 Speech processing method and speech processing device

Country Status (2)

Country Link
JP (1) JP2019060941A (en)
WO (1) WO2019059094A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0247700A (en) * 1988-08-10 1990-02-16 Nippon Hoso Kyokai <Nhk> Speech synthesizing method
JP2004086001A (en) * 2002-08-28 2004-03-18 Sony Corp Conversation processing system, conversation processing method, and computer program
US20050261905A1 (en) * 2004-05-21 2005-11-24 Samsung Electronics Co., Ltd. Method and apparatus for generating dialog prosody structure, and speech synthesis method and system employing the same
JP2006208460A (en) * 2005-01-25 2006-08-10 Honda Motor Co Ltd Equipment controller of voice recognition type and vehicle
JP2015069038A (en) * 2013-09-30 2015-04-13 ヤマハ株式会社 Voice synthesizer and program
JP2017106990A (en) * 2015-12-07 2017-06-15 ヤマハ株式会社 Voice interactive device and program

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0247700A (en) * 1988-08-10 1990-02-16 Nippon Hoso Kyokai <Nhk> Speech synthesizing method
JP2004086001A (en) * 2002-08-28 2004-03-18 Sony Corp Conversation processing system, conversation processing method, and computer program
US20050261905A1 (en) * 2004-05-21 2005-11-24 Samsung Electronics Co., Ltd. Method and apparatus for generating dialog prosody structure, and speech synthesis method and system employing the same
JP2006208460A (en) * 2005-01-25 2006-08-10 Honda Motor Co Ltd Equipment controller of voice recognition type and vehicle
JP2015069038A (en) * 2013-09-30 2015-04-13 ヤマハ株式会社 Voice synthesizer and program
JP2017106990A (en) * 2015-12-07 2017-06-15 ヤマハ株式会社 Voice interactive device and program

Also Published As

Publication number Publication date
JP2019060941A (en) 2019-04-18

Similar Documents

Publication Publication Date Title
US10854219B2 (en) Voice interaction apparatus and voice interaction method
WO2017006766A1 (en) Voice interaction method and voice interaction device
US8898055B2 (en) Voice quality conversion device and voice quality conversion method for converting voice quality of an input speech using target vocal tract information and received vocal tract information corresponding to the input speech
JP5593244B2 (en) Spoken speed conversion magnification determination device, spoken speed conversion device, program, and recording medium
EP3065130B1 (en) Voice synthesis
RU2003129075A (en) METHOD AND SYSTEM OF DYNAMIC ADAPTATION OF SPEECH SYNTHESIS TO INCREASE THE DECISIBILITY OF SYNTHESIZED SPEECH
US11727949B2 (en) Methods and apparatus for reducing stuttering
JP2006145867A (en) Voice processor and voice processing program
CN111418006A (en) Speech synthesis method, speech synthesis device, and program
WO2019181767A1 (en) Sound processing method, sound processing device, and program
JP6821970B2 (en) Speech synthesizer and speech synthesizer
JP2005070430A (en) Speech output device and method
CN111837183A (en) Sound processing method, sound processing device and recording medium
JP6569588B2 (en) Spoken dialogue apparatus and program
JP6657888B2 (en) Voice interaction method, voice interaction device, and program
JP6657887B2 (en) Voice interaction method, voice interaction device, and program
JP6728660B2 (en) Spoken dialogue method, spoken dialogue device and program
WO2019059094A1 (en) Speech processing method and speech processing device
JP6911398B2 (en) Voice dialogue methods, voice dialogue devices and programs
WO2017098940A1 (en) Speech interacting device and speech interacting method
Aso et al. Speakbysinging: Converting singing voices to speaking voices while retaining voice timbre
JP2018146907A (en) Voice interaction method and voice interactive device
JP2019159014A (en) Sound processing method and sound processing device
KR20060027645A (en) Emotion information tone conversion device and method
JP2014157325A (en) Sound processing device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18858032

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18858032

Country of ref document: EP

Kind code of ref document: A1