CN114360535B

CN114360535B - Voice dialog generation method, device, electronic device and storage medium

Info

Publication number: CN114360535B
Application number: CN202111601277.6A
Authority: CN
Inventors: 吴文权; 吴华
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-12-24
Filing date: 2021-12-24
Publication date: 2023-01-31
Anticipated expiration: 2041-12-24
Also published as: CN114360535A

Abstract

The present disclosure provides a method and an apparatus for generating a voice dialog, an electronic device, and a storage medium, which relate to the field of computer technologies, and in particular, to the technical fields of artificial intelligence, such as voice technology, natural language processing, and computer vision. The specific implementation scheme is as follows: performing voice recognition on the acquired input voice to determine a first text corresponding to the input voice; performing audio feature extraction on input voice to determine a first audio feature corresponding to the input voice; determining a second text and a second audio characteristic corresponding to the reply sentence to be generated according to the first audio characteristic and the first text; based on the second audio feature and the second text, a response speech is generated. Therefore, the second text and the second audio characteristics are determined according to the first audio characteristics and the first text corresponding to the input voice, so that the accuracy of the determined second text is improved, and the generated reply voice is more fit to the emotion of the speaker corresponding to the input voice.

Description

Voice dialog generation method, device, electronic device and storage medium

技术领域technical field

本公开涉及计算机技术领域，尤其涉及语音技术、自然语言处理、计算机视觉等人工智能技术领域，具体涉及一种语音对话的生成方法、装置、电子设备及存储介质。The present disclosure relates to the field of computer technology, in particular to the field of artificial intelligence technology such as speech technology, natural language processing, and computer vision, and in particular to a method, device, electronic device, and storage medium for generating a voice dialogue.

背景技术Background technique

随着人工智能技术地不断发展和完善，其已经在与人类日常生活相关的各个领域扮演着极其重要的作用。例如，人工智能已经在语音对话领域取得显著的进步。相关技术中，可以将语音信息转化为文本，并对文本进行语义分析以确定答复文本。由于相关技术中仅根据语音信息中包含的文本这一单一的特征，确定答复文本，从而可能导致最终确定的答复文本的准确性较低，因此，如何提高答复语句的准确性成为重点的研究方向。With the continuous development and improvement of artificial intelligence technology, it has played an extremely important role in various fields related to human daily life. For example, artificial intelligence has made remarkable progress in the field of spoken dialogue. In related technologies, speech information can be converted into text, and semantic analysis is performed on the text to determine the reply text. Since the relevant technology only determines the answer text based on the single feature of the text contained in the voice information, which may lead to a lower accuracy of the final answer text, therefore, how to improve the accuracy of the answer sentence has become a key research direction .

发明内容Contents of the invention

本公开提供了一种语音对话的生成方法、装置、电子设备及存储介质。The disclosure provides a method, device, electronic equipment and storage medium for generating a voice dialogue.

根据本公开的第一方面，提供了一种语音对话的生成方法，包括：According to a first aspect of the present disclosure, a method for generating a voice dialogue is provided, including:

对获取的输入语音进行语音识别，以确定所述输入语音对应的第一文本；performing speech recognition on the acquired input speech to determine the first text corresponding to the input speech;

对所述输入语音进行音频特征提取，以确定所述输入语音对应的第一音频特征；performing audio feature extraction on the input speech to determine a first audio feature corresponding to the input speech;

根据所述第一音频特征及所述第一文本，确定待生成的答复语句对应的第二文本及第二音频特征；According to the first audio feature and the first text, determine the second text and the second audio feature corresponding to the reply sentence to be generated;

基于所述第二音频特征及所述第二文本，生成答复语音。A reply speech is generated based on the second audio feature and the second text.

根据本公开的第二方面，提供了一种语音对话的生成装置，包括：According to a second aspect of the present disclosure, an apparatus for generating a voice dialogue is provided, including:

第一确定模块，用于对获取的输入语音进行语音识别，以确定所述输入语音对应的第一文本；A first determining module, configured to perform voice recognition on the acquired input voice, so as to determine the first text corresponding to the input voice;

第二确定模块，用于对所述输入语音进行音频特征提取，以确定所述输入语音对应的第一音频特征；The second determination module is used to perform audio feature extraction on the input speech, so as to determine the first audio feature corresponding to the input speech;

第三确定模块，用于根据所述第一音频特征及所述第一文本，确定待生成的答复语句对应的第二文本及第二音频特征；A third determining module, configured to determine a second text and a second audio feature corresponding to the reply sentence to be generated according to the first audio feature and the first text;

生成模块，用于基于所述第二音频特征及所述第二文本，生成答复语音。A generating module, configured to generate a reply voice based on the second audio feature and the second text.

根据本公开的第三方面，提供了一种电子设备，包括：According to a third aspect of the present disclosure, an electronic device is provided, including:

至少一个处理器；以及at least one processor; and

与所述至少一个处理器通信连接的存储器；其中，a memory communicatively coupled to the at least one processor; wherein,

所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行如第一方面所述的语音对话的生成方法。The memory stores instructions executable by the at least one processor, the instructions are executed by the at least one processor, so that the at least one processor can perform the generation of the voice dialogue as described in the first aspect method.

根据本公开第四方面，提供了一种存储有计算机指令的非瞬时计算机可读存储介质，所述计算机指令用于使所述计算机执行如第一方面所述的语音对话的生成方法。According to a fourth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions, the computer instructions are used to cause the computer to execute the method for generating a speech dialogue as described in the first aspect.

根据本公开的第五方面，提供了一种计算机程序产品，包括计算机指令，所述计算机指令在被处理器执行时实现如第一方面所述的语音对话的生成方法的步骤。According to a fifth aspect of the present disclosure, there is provided a computer program product, including computer instructions, and when executed by a processor, the computer instructions implement the steps of the method for generating a speech dialog as described in the first aspect.

本公开提供的语音对话的生成方法、装置、电子设备及存储介质，存在如下有益效果：The voice dialog generation method, device, electronic equipment, and storage medium provided by the present disclosure have the following beneficial effects:

本公开实施例中，先对获取的输入语音进行语音识别，以确定输入语音对应的第一文本，之后对输入语音进行音频特征提取，以确定输入语音对应的第一音频特征，再根据第一音频特征及第一文本，确定待生成的答复语句对应的第二文本及第二音频特征，最后基于第二音频特征及第二文本，生成答复语音。由此，根据输入语音对应的第一音频特征及第一文本，确定答复语句对应的第二文本及第二音频特征，从而不仅提高了确定的第二文本的准确性，而且可以根据输入语句对应的情绪特征确定答复语句的情绪特征，从而使生成的答复语音更加贴合输入语音对应的说话者的情绪。In the embodiment of the present disclosure, speech recognition is first performed on the acquired input speech to determine the first text corresponding to the input speech, and then audio feature extraction is performed on the input speech to determine the first audio feature corresponding to the input speech, and then according to the first The audio feature and the first text determine the second text and the second audio feature corresponding to the answer sentence to be generated, and finally generate the answer voice based on the second audio feature and the second text. Thus, according to the first audio feature and the first text corresponding to the input voice, the second text and the second audio feature corresponding to the answer sentence are determined, thereby not only improving the accuracy of the determined second text, but also corresponding to the input sentence. The emotional feature of the response sentence determines the emotional feature of the reply sentence, so that the generated reply speech is more suitable for the emotion of the speaker corresponding to the input speech.

应当理解，本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征，也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。It should be understood that what is described in this section is not intended to identify key or important features of the embodiments of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will be readily understood through the following description.

附图说明Description of drawings

附图用于更好地理解本方案，不构成对本公开的限定。其中：The accompanying drawings are used to better understand the present solution, and do not constitute a limitation to the present disclosure. in:

图1是根据本公开一实施例提供的一种语音对话的生成方法的流程示意图；FIG. 1 is a schematic flowchart of a method for generating a speech dialogue according to an embodiment of the present disclosure;

图2是根据本公开又一实施例提供的一种语音对话的生成方法的流程示意图；Fig. 2 is a schematic flowchart of a method for generating a voice dialog according to another embodiment of the present disclosure;

图3是根据本公开又一实施例提供的一种语音对话的生成方法的流程示意图；Fig. 3 is a schematic flowchart of a method for generating a speech dialogue according to another embodiment of the present disclosure;

图4是根据本公开一实施例提供的一种语音对话的生成装置的结构示意图；Fig. 4 is a schematic structural diagram of an apparatus for generating a voice dialogue according to an embodiment of the present disclosure;

图5是用来实现本公开实施例的语音对话的生成方法的电子设备的框图。Fig. 5 is a block diagram of an electronic device used to implement the method for generating a speech dialogue according to an embodiment of the present disclosure.

具体实施方式Detailed ways

以下结合附图对本公开的示范性实施例做出说明，其中包括本公开实施例的各种细节以助于理解，应当将它们认为仅仅是示范性的。因此，本领域普通技术人员应当认识到，可以对这里描述的实施例做出各种改变和修改，而不会背离本公开的范围和精神。同样，为了清楚和简明，以下的描述中省略了对公知功能和结构的描述。Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and they should be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

本公开实施例涉及计算机视觉、深度学习等人工智能技术领域。Embodiments of the present disclosure relate to artificial intelligence technology fields such as computer vision and deep learning.

人工智能(Artificial Intelligence)，英文缩写为AI。它是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及应用系统的一门新的技术科学。Artificial Intelligence (Artificial Intelligence), the English abbreviation is AI. It is a new technical science that studies and develops theories, methods, technologies and application systems for simulating, extending and expanding human intelligence.

语音技术在计算机领域中的关键技术有自动语音识别技术(Automatic SpeechRecognition，ASR)和语音合成技术(Text To Speech，TTS)。让计算机能听、能看、能说、能感觉，是未来人机交互的发展方向，其中语音成为未来最被看好的人机交互方式，语音比其他的交互方式有更多的优势。The key technologies of speech technology in the computer field include automatic speech recognition technology (Automatic SpeechRecognition, ASR) and speech synthesis technology (Text To Speech, TTS). Enabling computers to hear, see, speak, and feel is the development direction of human-computer interaction in the future. Among them, voice will become the most promising human-computer interaction method in the future, and voice has more advantages than other interactive methods.

自然语言处理是用计算机来处理、理解以及运用人类语言(如中文、英文等)，它是计算机科学与语言学的交叉学科，又常被称为计算语言学。由于自然语言是人类区别于其他动物的根本标志。没有语言，人类的思维也就无从谈起，所以自然语言处理体现了人工智能的最高任务与境界，也就是说，只有当计算机具备了处理自然语言的能力时，机器才算实现了真正的智能。Natural language processing is the use of computers to process, understand and use human languages (such as Chinese, English, etc.). It is an interdisciplinary subject between computer science and linguistics, and is often called computational linguistics. Because natural language is the fundamental symbol that distinguishes human beings from other animals. Without language, human thinking is impossible to talk about, so natural language processing embodies the highest task and state of artificial intelligence, that is to say, only when the computer has the ability to process natural language, the machine can be regarded as realizing real intelligence .

计算机视觉，指用摄影机和电脑代替人眼对目标进行识别、跟踪和测量等机器视觉，并进一步做图形处理，使电脑处理成为更适合人眼观察或传送给仪器检测的图像。Computer vision refers to machine vision that uses cameras and computers instead of human eyes to identify, track and measure targets, and further performs graphics processing to make computer processing images that are more suitable for human eyes to observe or sent to instruments for detection.

本公开的技术方案中，所涉及的用户个人信息的收集、存储、使用、加工、传输、提供和公开等处理，均符合相关法律法规的规定，且不违背公序良俗。In the technical solution of this disclosure, the collection, storage, use, processing, transmission, provision, and disclosure of user personal information involved are all in compliance with relevant laws and regulations, and do not violate public order and good customs.

其中，需要说明的是，本实施例的语音对话的生成方法的执行主体为语音对话的生成装置，该装置可以由软件和/或硬件的方式实现，该装置可以配置在电子设备中，电子设备可以包括但不限于终端、服务器端等。Wherein, it should be noted that the voice dialogue generation method of this embodiment is executed by a voice dialogue generation device, which can be implemented by software and/or hardware, and which can be configured in an electronic device. It may include but not limited to terminals, servers, etc.

如图1所示，该语音对话的生成方法包括：As shown in Figure 1, the method for generating the voice dialogue includes:

S101：对获取的输入语音进行语音识别，以确定输入语音对应的第一文本。S101: Perform voice recognition on the acquired input voice to determine a first text corresponding to the input voice.

其中，获取的输入语音可以为需要根据语音中包含的内容生成相应的答复文本的语音。输入语音可以为一段连续的语音，例如一个句子、一段话等。Wherein, the acquired input speech may be a speech that needs to generate a corresponding reply text according to the content contained in the speech. The input speech may be a continuous speech, such as a sentence, a paragraph, and so on.

可选的，可以通过语音采集设备，例如麦克风、声音传感器等获取输入语音，还可以通过从存储语音的存储空间中读取输入语音，本实施例对输入语音的获取方式不做限制。Optionally, the input voice can be acquired through a voice collection device, such as a microphone, a sound sensor, etc., or read from a storage space where the voice is stored. This embodiment does not limit the way of acquiring the input voice.

其中，第一文本是指输入语音中包含的文本，即将输入语音中包含的内容用文本的形式显示。Wherein, the first text refers to the text contained in the input voice, that is, the content contained in the input voice is displayed in the form of text.

本公开实施例中，语音识别用于把输入语音对应的语音信号转变为对应的第一文本。可选的，可以采用隐马尔可夫模型(Hidden Markov Model，HMM)对输入语音进行语音识别，以确定输入语音对应的第一文本；或者，也可以通过将获取的输入语音与语音数据库中语音进行比对，找到相同的语音，进而得到语音数据库中语音对应的话语文本作为输入语音对应的第一文本。本公开对此不做限定。In the embodiment of the present disclosure, the voice recognition is used to convert the voice signal corresponding to the input voice into the corresponding first text. Optionally, a hidden Markov model (Hidden Markov Model, HMM) can be used to perform voice recognition on the input voice to determine the first text corresponding to the input voice; Perform comparison to find the same voice, and then obtain the utterance text corresponding to the voice in the voice database as the first text corresponding to the input voice. The present disclosure does not limit this.

S102：对输入语音进行音频特征提取，以确定输入语音对应的第一音频特征。S102: Perform audio feature extraction on the input speech to determine a first audio feature corresponding to the input speech.

其中，第一音频特征可以为输入语音对应的语音信号的频率，幅值等信息。Wherein, the first audio feature may be information such as frequency and amplitude of a speech signal corresponding to the input speech.

需要说明的是，语音信号的频率、幅值等特征可以反映出输入语音对应的说话者的情绪信息。比如，输入语音对应的语音信号的频率较高，表示说话者语速较快，情绪可能较为急躁；语音信号的频率正常时，表示说话者的情绪可能较为轻松。语音信号的幅值较高时，表示说话者的声音较大，情绪可能较为高涨时。语音信号对应的幅值较低时，表示说话者的声音较小，情绪可能较为低迷。It should be noted that the frequency, amplitude and other features of the voice signal can reflect the emotional information of the speaker corresponding to the input voice. For example, if the frequency of the voice signal corresponding to the input voice is high, it means that the speaker speaks faster and may be more irritable; when the frequency of the voice signal is normal, it means that the speaker may be in a relaxed mood. When the amplitude of the speech signal is higher, it means that the speaker's voice is louder and the emotion may be higher. When the amplitude corresponding to the voice signal is low, it means that the speaker's voice is low and the mood may be depressed.

可选的，可以采用快速傅里叶变换对输入语音进行音频特征提取，以确定输入语音对应的频率、幅值等。或者，也可以使用matlab工具中的max函数提取输入语音对应的幅值，使用pitch函数提取输入语音中的频率。本公开对此不做限定。Optionally, fast Fourier transform may be used to perform audio feature extraction on the input speech, so as to determine the frequency, amplitude, etc. corresponding to the input speech. Alternatively, you can also use the max function in the matlab tool to extract the amplitude corresponding to the input voice, and use the pitch function to extract the frequency in the input voice. The present disclosure does not limit this.

S103：根据第一音频特征及第一文本，确定待生成的答复语句对应的第二文本及第二音频特征。S103: According to the first audio feature and the first text, determine the second text and the second audio feature corresponding to the answer sentence to be generated.

需要说明的是，本公开实施例中，在根据输入语音的第一音频特征，及第一文本确定答复语句时，不仅可以确定答复语句对应的第二文本，而且可以同时确定答复语句的第二音频特征，即可以确定播放答复语句时的情绪。It should be noted that, in the embodiment of the present disclosure, when determining the reply sentence according to the first audio feature of the input voice and the first text, not only the second text corresponding to the reply sentence can be determined, but also the second text of the reply sentence can be determined at the same time. Audio characteristics, which can determine the mood when the reply sentence is played.

其中，第二文本可以为根据第一音频特征及第一文本生成的，用于答复输入语音的文本。Wherein, the second text may be a text generated according to the first audio feature and the first text, and used to reply the input voice.

其中，第二音频特征可以为根据第一音频特征确定的，播放第二文本时的情绪特征。比如，第一音频特征为频率较高、幅值也较高，表示输入语句对应的说话者的情绪较为暴躁，因此，答复语句对应的第二音频特征可以为频率适中、幅值适中，即采用较为舒缓的语调播放第二文本。Wherein, the second audio feature may be an emotional feature when playing the second text determined according to the first audio feature. For example, the first audio feature is higher in frequency and higher in amplitude, indicating that the speaker corresponding to the input sentence is in a more irritable mood. Therefore, the second audio feature corresponding to the reply sentence can be moderate in frequency and moderate in amplitude, that is, using Play the second text in a more soothing tone.

可选的，可以将第一音频特征及第一文本输入预设的对话模型中，以获取待生成的答复语句对应的第二文本及第二音频特征。Optionally, the first audio feature and the first text may be input into a preset dialogue model, so as to obtain the second text and the second audio feature corresponding to the reply sentence to be generated.

或者，也可以先提取第一文本中包含的关键词，之后根据第一文本中包含的关键词及第一音频特征，确定待生成的答复语句对应的第二文本及第二音频特征。Alternatively, the keywords contained in the first text may also be extracted first, and then the second text and the second audio features corresponding to the answer sentences to be generated are determined according to the keywords contained in the first text and the first audio features.

本公开实施例中，根据输入语音对应的第一音频特征及第一文本，确定答复语句的第二文本，从而在输入语音对应的第一文本相同的情况下，若输入语音对应的第一音频特征不同，生成的答复语句对应的第二文本也不同，从而不仅提高了答复语句的准确性，而且使确定的答复语句更加贴合输入语音对应的说话者的情绪，提高了答复文本的多样性。In the embodiment of the present disclosure, according to the first audio feature and the first text corresponding to the input voice, the second text of the reply sentence is determined, so that when the first text corresponding to the input voice is the same, if the first audio corresponding to the input voice The characteristics are different, and the second text corresponding to the generated reply sentence is also different, which not only improves the accuracy of the reply sentence, but also makes the determined reply sentence more suitable for the emotion of the speaker corresponding to the input voice, and improves the diversity of the reply text .

S104：基于第二音频特征及第二文本，生成答复语音。S104: Generate a reply voice based on the second audio feature and the second text.

其中，答复语音为采用第二音频特征播放第二文本得到的语音。Wherein, the reply voice is the voice obtained by playing the second text by using the second audio feature.

可选的，可以采用语音合成技术(Text to Speech，TTS)，将第二文本及第二音频特征相结合，生成答复语音。Optionally, a speech synthesis technology (Text to Speech, TTS) may be used to combine the second text and the second audio feature to generate a reply speech.

图2是根据本公开又一实施例提供的一种语音对话的生成方法的流程示意图。如图2所示，该语音对话的生成方法包括：Fig. 2 is a schematic flowchart of a method for generating a voice dialog according to yet another embodiment of the present disclosure. As shown in Figure 2, the generation method of the speech dialogue includes:

S201：对获取的输入语音进行语音识别，以确定输入语音对应的第一文本。S201: Perform voice recognition on the acquired input voice to determine a first text corresponding to the input voice.

其中，步骤S201的具体实现形式，可参照本公开中其他各实施例中的详细描述，此处不再详细赘述。Wherein, for the specific implementation form of step S201, reference may be made to the detailed descriptions in other embodiments of the present disclosure, and details will not be repeated here.

S202：对输入语音进行音频特征提取，以确定输入语音对应的第一音频特征。S202: Perform audio feature extraction on the input speech to determine a first audio feature corresponding to the input speech.

其中，第一音频特征可以包括幅值特征和频率特征。Wherein, the first audio feature may include an amplitude feature and a frequency feature.

可选的，可以先根据输入语音中每帧语音对应的第一幅值，确定输入语音对应的第二幅值，之后根据第二幅值所属的范围，确定输入语音对应的幅值特征。Optionally, the second amplitude value corresponding to the input voice may be determined first according to the first amplitude value corresponding to each frame of voice in the input voice, and then the amplitude characteristic corresponding to the input voice may be determined according to the range to which the second amplitude value belongs.

其中，第一幅值可以为每帧语音对应的幅值中的最大值。Wherein, the first amplitude may be the maximum value among the amplitudes corresponding to each frame of speech.

其中，第二幅值可以为每帧语音对应的第一幅值中的最大值。即将输入语音对应的最大幅值作为输入语音对应的第二幅值。Wherein, the second amplitude may be the maximum value among the first amplitudes corresponding to each frame of speech. That is, the maximum amplitude corresponding to the input voice is used as the second amplitude corresponding to the input voice.

其中，幅值特征可以包括：高幅值、中幅值及低幅值等，本公开对此不做限定。需要说明的是，每个幅值特征对应不同的幅值范围，本公开实施例中，可以根据第二幅值所属的范围，确定输入语音对应的幅值特征。Wherein, the amplitude feature may include: high amplitude, medium amplitude, low amplitude, etc., which is not limited in the present disclosure. It should be noted that each amplitude feature corresponds to a different amplitude range. In the embodiment of the present disclosure, the amplitude feature corresponding to the input speech may be determined according to the range to which the second amplitude belongs.

本公开实施例中，在获取输入语音中每帧语音对应的第一幅值之前，可以先对输入语音进行分帧处理，即将第一音频数据切分为固定长度的小段。In the embodiment of the present disclosure, before obtaining the first amplitude value corresponding to each frame of the input voice, the input voice may be divided into frames, that is, the first audio data is divided into small segments of fixed length.

可选的，可以采用任何可取的方式，确定输入语音中每帧语音对应的第一幅值，本公开对此不做限定。比如，可以将输入语音中的每帧语音进行傅里叶变换，以获取每帧语音对应的第一幅值。Optionally, any preferred manner may be used to determine the first amplitude corresponding to each frame of speech in the input speech, which is not limited in the present disclosure. For example, Fourier transform may be performed on each frame of speech in the input speech to obtain the first amplitude value corresponding to each frame of speech.

本公开实施例中，先根据输入语音中每帧语音对应的第一幅值，确定输入语音对应的第二幅值，之后根据第二幅值所属的范围，确定输入语音为高幅值、中幅值、或低幅值，从而可以用一个幅值特征表征输入音频对应的多个幅值，从而在不影响后续获取第二文本的准确性的情况下，降低了后续处理的计算量。In the embodiment of the present disclosure, the second amplitude corresponding to the input voice is first determined according to the first amplitude corresponding to each frame of voice in the input voice, and then the input voice is determined to be high amplitude, medium amplitude, or medium according to the range to which the second amplitude belongs. Amplitude, or low amplitude, so that multiple amplitudes corresponding to the input audio can be represented by one amplitude feature, thereby reducing the calculation amount of subsequent processing without affecting the accuracy of subsequent acquisition of the second text.

可选的，可以对输入语音进行基音检测，以确定语音信号对应的频率值，之后根据频率值所属的范围，确定输入语音对应的频率特征。Optionally, pitch detection may be performed on the input speech to determine the frequency value corresponding to the speech signal, and then determine the frequency feature corresponding to the input speech according to the range to which the frequency value belongs.

其中，频率特征可以包括：高频、中频及低频等，本公开对此不做限定。需要说明的是，每个频率特征对应不同的频率范围，本公开实施例中，可以根据频率值所属的范围，确定输入语音对应的频率特征。Wherein, the frequency feature may include: high frequency, intermediate frequency, low frequency, etc., which is not limited in the present disclosure. It should be noted that each frequency feature corresponds to a different frequency range. In the embodiment of the present disclosure, the frequency feature corresponding to the input voice may be determined according to the range to which the frequency value belongs.

可选的，可以对输入语音进行基音检测，以获取输入语音对应的最大频率，并将最大的频率作为语音信号对应的频率值。或者，也可以将输入语音对应的平均频率作为语音信号对应的频率值。本公开对此不做限定。Optionally, pitch detection may be performed on the input speech to obtain the maximum frequency corresponding to the input speech, and the maximum frequency may be used as the frequency value corresponding to the speech signal. Alternatively, the average frequency corresponding to the input voice may also be used as the frequency value corresponding to the voice signal. The present disclosure does not limit this.

本公开实施例中，先对输入语音进行基音检测，以确定语音信号对应的频率值，之后根据频率值所属的范围，确定输入语音为高频率、中频率或低频率，从而可以用一个幅值特征代表输入音频对应的多个幅值，从而在不影响后续获取第二文本的准确性的情况下，降低了后续处理的计算量。In the embodiment of the present disclosure, the pitch detection is first performed on the input voice to determine the frequency value corresponding to the voice signal, and then according to the range of the frequency value, it is determined whether the input voice is high frequency, medium frequency or low frequency, so that an amplitude can be used The features represent multiple amplitudes corresponding to the input audio, thereby reducing the amount of computation for subsequent processing without affecting the accuracy of subsequent acquisition of the second text.

S203：根据第一音频特征及第一文本，确定待生成的答复语句对应的第二文本及第二文本中包含的表情符号。S203: According to the first audio feature and the first text, determine the second text corresponding to the answer sentence to be generated and the emoticons contained in the second text.

其中，表情符号可以包括：开心、惊讶、害怕、担心、和蔼等，本公开对此不做限定。Wherein, the emoticon may include: happy, surprised, afraid, worried, kind, etc., which is not limited in the present disclosure.

需要说明的是，第二文本中包含的表情符号可以为一个，也可以为多个，本公开对此不做限定。比如，第二文本中的第一句话对应一个表情符号，第二句话、及第三句话对应一个表情符号等。It should be noted that there may be one or more emoticons contained in the second text, which is not limited in the present disclosure. For example, the first sentence in the second text corresponds to an emoticon, the second sentence and the third sentence correspond to an emoticon, and so on.

举例来说，若第一音频特征中包含的幅值特征为低频、频率特征也为低频，进而确定输入语音对应的说话者的情绪为悲伤，则生成的第二文本中包含的表情符号可以为担心。For example, if the amplitude feature contained in the first audio feature is low frequency, and the frequency feature is also low frequency, and then it is determined that the emotion of the speaker corresponding to the input voice is sad, then the emoticon contained in the generated second text can be Worry.

需要说明的是，上述示例只是简单的举例说明，不能作为本公开实施例中第一音频特征、第二文本中包含的表情符号的具体限定。It should be noted that the above example is only a simple illustration, and cannot be used as a specific limitation of the first audio feature and the emoticons contained in the second text in the embodiment of the present disclosure.

S204：在交互设备的显示屏幕上，显示第二文本及表情符号。S204: Display the second text and the emoticon on the display screen of the interactive device.

其中，交互设备为可以与用户实现交互的电子设备。交互设备可以通过接收用户的交互请求，并对交互请求进行处理，以生成交互请求对应的结果，进而通过语音、文本等形式向用户展示结果。Wherein, the interactive device is an electronic device capable of interacting with a user. The interaction device can receive the user's interaction request and process the interaction request to generate a result corresponding to the interaction request, and then display the result to the user in the form of voice or text.

本公开实施例中，在确定了输入语音对应答复语句的第二文本及第二文本中包含的表情符号之后，可以在交互设备的显示屏幕上，显示第二文本及表情符号，从而可以使用户结合显示界面中包含的表情符号，阅读第二文本，从而不仅答复了用户的请求，从而实现了多角度的与用户的有效沟通。In the embodiment of the present disclosure, after determining the second text and the emoticons contained in the reply sentence corresponding to the input voice, the second text and the emoticons can be displayed on the display screen of the interactive device, so that the user can Combining with the emoticons included in the display interface, reading the second text not only answers the user's request, but also realizes multi-angle effective communication with the user.

可选的，本公开实施例中，不仅可以在交互设备的显示屏幕上，显示第二文本及表情符号，而且也可以使交互设备采用表情符号对应的语调播放第二文本。Optionally, in this embodiment of the present disclosure, not only can the second text and emoticons be displayed on the display screen of the interactive device, but also the interactive device can play the second text in the tone corresponding to the emoticon.

举例来说，若第二文本中包含的表情符号为担心，则交互设备不仅可以在显示屏幕上显示第二文本及担心的表情符号，而且可以用安慰的语调播放第二文本。For example, if the emoticon contained in the second text is worry, the interactive device can not only display the second text and the emoticon of worry on the display screen, but also play the second text in a comforting tone.

本公开实施例中，先对获取的输入语音进行语音识别，以确定输入语音对应的第一文本，之后对输入语音进行音频特征提取，以确定输入语音对应的第一音频特征，再根据第一音频特征及第一文本，确定待生成的答复语句对应的第二文本及第二文本中包含的表情符号，最后在交互设备的显示屏幕上，显示第二文本及表情符号。由此，根据输入语音对应的第一音频特征及第一文本，确定答复语句对应的第二文本及第二文本中包含的表情符号，从而可以使确定的第二文本更加准确，而且还可以在交互设备的显示屏幕中显示第二语句及对应的表情符号，从而实现了多角度的与用户的有效沟通。In the embodiment of the present disclosure, speech recognition is first performed on the acquired input speech to determine the first text corresponding to the input speech, and then audio feature extraction is performed on the input speech to determine the first audio feature corresponding to the input speech, and then according to the first The audio features and the first text determine the second text corresponding to the answer sentence to be generated and the emoticons contained in the second text, and finally display the second text and the emoticons on the display screen of the interactive device. Thus, according to the first audio feature and the first text corresponding to the input voice, the second text corresponding to the reply sentence and the emoticons contained in the second text are determined, so that the determined second text can be more accurate, and it can also be The second sentence and the corresponding emoticons are displayed on the display screen of the interactive device, thereby realizing effective communication with the user from multiple angles.

图3是根据本公开又一实施例提供的一种语音对话的生成方法的流程示意图。如图3所示，该语音对话的生成方法包括：Fig. 3 is a schematic flowchart of a method for generating a voice dialog according to yet another embodiment of the present disclosure. As shown in Figure 3, the method for generating the speech dialogue includes:

S301：对获取的输入语音进行语音识别，以确定输入语音对应的第一文本。S301: Perform voice recognition on the acquired input voice to determine a first text corresponding to the input voice.

S302：对输入语音进行音频特征提取，以确定输入语音对应的第一音频特征。S302: Perform audio feature extraction on the input speech to determine a first audio feature corresponding to the input speech.

其中，步骤S301、步骤S302的具体实现形式，可参照本公开中其他各实施例中的详细描述，此处不再详细赘述。Wherein, for the specific implementation forms of step S301 and step S302, reference may be made to the detailed descriptions in other embodiments of the present disclosure, and details will not be repeated here.

S303：根据第一音频特征及第一文本，确定待生成的答复语句对应的第二文本及第二音频特征。S303: According to the first audio feature and the first text, determine the second text and the second audio feature corresponding to the reply sentence to be generated.

本公开实施例中，可以将第一音频特征及第一文本输入预设的对话模型中，以获取待生成的答复语句对应的第二文本及第二音频特征。In the embodiment of the present disclosure, the first audio feature and the first text may be input into a preset dialogue model, so as to obtain the second text and the second audio feature corresponding to the reply sentence to be generated.

可选的，获取预设的对话模型的具体步骤可以包括：获取训练样本集，其中，训练样本集中包含输入文本及对应的音频特征，输入文本对应的标注答复文本及对应的音频特征标签，之后将输入文本及对应的音频特征输入初始对话模型中，以获取初始对话模型输出的预测答复文本及对应的预测音频特征，之后再根据预测答复文本与标注答复文本之间的差异，及预测音频特征与音频特征标签之间的差异对初始对话模型进行修正，以生成预设的对话模型。Optionally, the specific steps of obtaining a preset dialogue model may include: obtaining a training sample set, wherein the training sample set contains input text and corresponding audio features, and annotated answer text corresponding to the input text and corresponding audio feature labels, and then Input the input text and corresponding audio features into the initial dialogue model to obtain the predicted answer text and corresponding predicted audio features output by the initial dialogue model, and then predict the audio features according to the difference between the predicted answer text and the marked answer text The difference between the audio feature labels and the initial dialogue model is corrected to generate the preset dialogue model.

可选的，训练样本集可以通过以下方式获得：首先从网络信息中自动挖掘大量的文本对话语料，并对文本对话语料进行人工配音，之后对配音的样本语音数据进行音频特征提取，以获取文本对话语料中包含的输入文本及对应的音频特征，标注答复文本及对应的音频特征标签。Optionally, the training sample set can be obtained in the following ways: First, a large amount of text dialogue data is automatically mined from network information, and the text dialogue data is manually dubbed, and then the audio feature extraction is performed on the dubbed sample speech data to obtain the text The input text and corresponding audio features included in the dialogue data, and the answer text and corresponding audio feature labels are marked.

其中，音频特征可以包括频率特征和幅值特征。幅值特征可以包括：高幅值、中幅值及低幅值；频率特征可以包括：高频、中频及低频等。Wherein, the audio features may include frequency features and amplitude features. Amplitude features may include: high amplitude, medium amplitude, and low amplitude; frequency features may include: high frequency, middle frequency, and low frequency.

本公开实施例中，可以将配音得到的样本语音数据进行音频特征分析之后，将得到的频率及幅值按从大到小的顺序进行排序，进而将第一阈值范围内的频率标注为高频、将第二阈值范围内的频率标注为中频、第三阈值范围内的频率标注为低频；将第四阈值范围内的幅值标注为高幅值、将第五阈值范围内的幅值标注为中幅值、第六阈值范围内的幅值标注为低幅值。In the embodiment of the present disclosure, after audio feature analysis is performed on the sample speech data obtained from dubbing, the obtained frequencies and amplitudes can be sorted in descending order, and then the frequencies within the first threshold range can be marked as high frequency , mark the frequency within the second threshold range as intermediate frequency, mark the frequency within the third threshold range as low frequency; mark the amplitude within the fourth threshold range as high amplitude, and mark the amplitude within the fifth threshold range as The middle amplitude and the amplitude within the sixth threshold range are marked as low amplitude.

举例来说，若全部的样本语音数据对应的频率范围为[a，b]，则第一阈值范围可以为[b-10％*(b-a)，b]，即将频率范围内最高的10％的频率标注为高频，第二阈值范围可以为[a+10％*(b-a)，b-10％*(b-a)]，即将频率范围内10％-90％的频率标注为中频，第三阈值范围可以为[a，a+10％*(b-a)]，即将频率范围内最低的10％的频率标注为低频。For example, if the frequency range corresponding to all sample speech data is [a, b], then the first threshold range may be [b-10%*(b-a), b], that is, the highest 10% of the frequency range The frequency is marked as high frequency, and the second threshold range can be [a+10%*(b-a), b-10%*(b-a)], that is, the frequency within the frequency range of 10%-90% is marked as intermediate frequency, and the third threshold The range may be [a, a+10%*(b-a)], that is, the lowest 10% of frequencies in the frequency range are marked as low frequencies.

举例来说，若全部的样本语音数据对应的幅值范围为[c，d]，第四阈值范围可以为[d-10％*(d-c)，d]，即将幅值范围内最高的10％的幅值标注为高幅值，第五阈值范围可以为[c+10％*(d-c)，d-10％*(d-c)]，即将幅值范围内10％-90％的幅值标注为中幅值，第六阈值范围可以为[c，c+10％*(d-c)]，即将幅值范围内最低的10％的幅值标注为低幅值。For example, if the amplitude range corresponding to all sample speech data is [c, d], the fourth threshold range may be [d-10%*(d-c), d], that is, the highest 10% of the amplitude range The amplitude of is marked as a high amplitude, and the fifth threshold range can be [c+10%*(d-c), d-10%*(d-c)], that is, the amplitude of 10%-90% within the amplitude range is marked as For medium amplitude, the sixth threshold range may be [c, c+10%*(d-c)], that is, the lowest 10% of the amplitude within the amplitude range is marked as low amplitude.

需要说明的是，上述示例只是简单的举例说明，不能作为本公开实施例中第一阈值范围、第二阈值范围、第三阈值范围、第四阈值范围、第五阈值范围、第六阈值范围等的具体限定。It should be noted that the above examples are simply illustrations and cannot be used as the first threshold range, the second threshold range, the third threshold range, the fourth threshold range, the fifth threshold range, the sixth threshold range, etc. in the embodiments of the present disclosure. specific limitations.

将第四阈值范围内的幅值标注为高幅值、将第五阈值范围内的幅值标注为中幅值、第六阈值范围内的幅值标注为低幅值。The amplitude within the fourth threshold range is marked as high amplitude, the amplitude within the fifth threshold range is marked as medium amplitude, and the amplitude within the sixth threshold range is marked as low amplitude.

可以理解的是，由于预设的对话模型不能学习到所有取值的频率或幅值，因此，本公开实施例中，可以将频率或幅值按范围划分为不同的等级，从而可以提高对话模型的泛化能力。It can be understood that since the preset dialogue model cannot learn the frequencies or amplitudes of all values, in the embodiment of the present disclosure, the frequencies or amplitudes can be divided into different levels according to the range, so that the dialogue model can be improved. generalization ability.

S304：获取输入语音对应的场景图像。S304: Acquire a scene image corresponding to the input voice.

其中，场景图像中可以包括输入语音的说话者所处的场景，比如，教室、餐厅、操场等等。可选的，场景图像中可以包含输入语音对应的说话者的人脸图像，也可以不包含不包含说话者的人脸图像，本公开对此不做限定。Wherein, the scene image may include the scene where the speaker who inputs the voice is located, for example, a classroom, a restaurant, a playground, and the like. Optionally, the scene image may include the face image of the speaker corresponding to the input voice, or may not include the face image not including the speaker, which is not limited in the present disclosure.

可选的，在监测到采集的语音数据中包含用户语音的情况下，启动图像采集组件，以获取输入语音对应的场景图像。Optionally, when it is detected that the collected voice data contains the user's voice, the image acquisition component is started to acquire the scene image corresponding to the input voice.

其中，图像采集组件可以为交互设备中的具有拍照片功能的组件。比如，具有交互功能的手机设备、平板设备中包含的摄像头组件。Wherein, the image acquisition component may be a component in the interactive device that has the function of taking pictures. For example, camera components included in mobile phone devices and tablet devices with interactive functions.

或者，根据输入语音的获取时间，从采集的视频流中截取与输入语音对应的场景图像。Or, according to the acquisition time of the input voice, the scene image corresponding to the input voice is intercepted from the collected video stream.

可选的，用交互设备中包含的视频采集设备实时采集视频流，并将采集的视频流存储至存储器中，之后根据输入语音的获取时间，从采集的视频流中截取与输入语音对应的场景图像。Optionally, the video capture device included in the interactive device is used to capture the video stream in real time, and store the captured video stream in the memory, and then according to the acquisition time of the input voice, intercept the scene corresponding to the input voice from the captured video stream image.

本公开实施例中，在监测到采集的语音数据中包含用户语音的情况下，获取输入语音对应的场景图像；或者，根据输入语音的获取时间，从采集的视频流中截取与输入语音对应的场景图像，从而使获取的场景图像可以准确反映输入语音对应的说话者所处的场景。In the embodiment of the present disclosure, when it is detected that the collected voice data contains the user's voice, the scene image corresponding to the input voice is acquired; or, according to the acquisition time of the input voice, the scene image corresponding to the input voice is intercepted from the collected video stream. The scene image, so that the acquired scene image can accurately reflect the scene where the speaker corresponding to the input voice is located.

S305：对场景图像进行视觉特征提取，以确定场景图像对应的视觉特征。S305: Perform visual feature extraction on the scene image to determine visual features corresponding to the scene image.

其中，视觉特征可以为场景图像中包含的场景特征。比如，教师、餐厅、篮球场等。Wherein, the visual feature may be a scene feature included in the scene image. For example, teachers, restaurants, basketball courts, etc.

可选的，可以先对场景图像进行目标检测，以获取场景图像中包含的物体的种类及位置信息，进而根据各物体的种类及位置信息，确定场景图象描述的场景，即场景图像对应的视觉特征。Optionally, target detection can be performed on the scene image first to obtain the type and location information of the objects contained in the scene image, and then determine the scene described by the scene image according to the type and location information of each object, that is, the corresponding object of the scene image. visual features.

或者，可以对场景图像进行自动分割，以划分出场景图像中包含的对象或颜色区域，之后对每个图像子块提取特征，并建立索引，从而得到场景图像中每个物体对应的空间关系特征，进而基于每个物体的种类及空间关系，确定场景图像描述的场景。Alternatively, the scene image can be automatically segmented to divide the objects or color regions contained in the scene image, and then features are extracted for each image sub-block, and an index is established to obtain the spatial relationship features corresponding to each object in the scene image , and then determine the scene described by the scene image based on the type and spatial relationship of each object.

S306：根据视觉特征，对第二文本和/或第二音频特征进行修正。S306: Correct the second text and/or the second audio feature according to the visual feature.

可以理解的是，在确定了视觉特征之后，可以根据视觉特征对第二文本，或第二音频特征进行修正，从而使修正后的第二文本，或第二音频特征更加准确，进而更加贴合输入语音对应的说话者的情绪。It can be understood that after the visual features are determined, the second text or the second audio features can be corrected according to the visual features, so that the corrected second text or the second audio features are more accurate and fit The emotion of the speaker corresponding to the input speech.

S307：基于修正后的第二音频特征及第二文本，生成答复语音。S307: Generate a reply voice based on the modified second audio feature and the second text.

其中，步骤S307的具体实现形式，可参照本公开其他各实施例中的详细描述，此处不再详细赘述。Wherein, for the specific implementation form of step S307, reference may be made to the detailed descriptions in other embodiments of the present disclosure, and details will not be repeated here.

本公开实施例中，先对获取的输入语音进行语音识别，以确定输入语音对应的第一文本，之后对输入语音进行音频特征提取，以确定输入语音对应的第一音频特征，再根据第一音频特征及第一文本，确定待生成的答复语句对应的第二文本及第二音频特征，之后再根据场景图像对应的视觉特征对第二文本及第二音频特征进行修正，最后基于修正后的第二音频特征及第二文本，生成答复语音。由此，基于场景图像对应的视觉特征，对根据第一音频特征及第一文本生成的第二文本及第二音频特征进行修正，从而进一步提高了修正后的第二文本及第二音频特征的准确性，进而进一步提高了生成的答复语音的准确性，使得答复语音更加贴合输入语音对应的说话者的情绪。In the embodiment of the present disclosure, speech recognition is first performed on the acquired input speech to determine the first text corresponding to the input speech, and then audio feature extraction is performed on the input speech to determine the first audio feature corresponding to the input speech, and then according to the first Audio features and the first text, determine the second text and the second audio features corresponding to the reply sentence to be generated, and then modify the second text and the second audio features according to the visual features corresponding to the scene image, and finally based on the corrected The second audio feature and the second text generate a reply voice. Thus, based on the visual features corresponding to the scene image, the second text and the second audio features generated according to the first audio features and the first text are corrected, thereby further improving the performance of the corrected second text and the second audio features. The accuracy further improves the accuracy of the generated answering voice, so that the answering voice is more suitable for the emotion of the speaker corresponding to the input voice.

图4是根据本公开又一实施例提供的一种语音对话的生成装置的结构示意图，如图4所示，该语音对话的生成装置400，包括：第一确定模块410、第二确定模块420、第三确定模块430及生成模块440。FIG. 4 is a schematic structural diagram of an apparatus for generating a voice dialogue according to another embodiment of the present disclosure. As shown in FIG. 4 , the apparatus for generating a voice dialogue 400 includes: a first determination module 410 and a second determination module 420 , a third determining module 430 and a generating module 440 .

第一确定模块410，用于对获取的输入语音进行语音识别，以确定输入语音对应的第一文本；The first determination module 410 is configured to perform voice recognition on the acquired input voice, so as to determine the first text corresponding to the input voice;

第二确定模块420，用于对输入语音进行音频特征提取，以确定输入语音对应的第一音频特征；The second determination module 420 is used to extract the audio features of the input speech, so as to determine the first audio feature corresponding to the input speech;

第三确定模块430，用于根据第一音频特征及第一文本，确定待生成的答复语句对应的第二文本及第二音频特征；The third determination module 430 is used to determine the second text and the second audio feature corresponding to the reply sentence to be generated according to the first audio feature and the first text;

生成模块440，用于基于第二音频特征及第二文本，生成答复语音。The generating module 440 is configured to generate a reply voice based on the second audio feature and the second text.

可选的，第二确定模块420，具体用于：Optionally, the second determining module 420 is specifically used for:

根据输入语音中每帧语音对应的第一幅值，确定输入语音对应的第二幅值；According to the first amplitude corresponding to each frame of voice in the input voice, determine the second amplitude corresponding to the input voice;

根据第二幅值所属的范围，确定输入语音对应的幅值特征。According to the range to which the second amplitude belongs, the amplitude feature corresponding to the input speech is determined.

对输入语音进行基音检测，以确定语音信号对应的频率值；Perform pitch detection on the input voice to determine the frequency value corresponding to the voice signal;

根据频率值所属的范围，确定输入语音对应的频率特征。According to the range to which the frequency value belongs, the frequency feature corresponding to the input speech is determined.

可选的，生成模块440，包括：Optionally, generating module 440, including:

第一获取单元，用于获取输入语音对应的场景图像；A first acquisition unit, configured to acquire a scene image corresponding to the input voice;

第一确定单元，用于对场景图像进行视觉特征提取，以确定场景图像对应的视觉特征；The first determination unit is configured to perform visual feature extraction on the scene image to determine visual features corresponding to the scene image;

修正单元，用于根据视觉特征，对第二文本和/或第二音频特征进行修正；a correction unit, configured to correct the second text and/or the second audio feature according to the visual feature;

生成单元，用于基于修正后的第二音频特征及第二文本，生成答复语音。A generating unit, configured to generate a reply voice based on the modified second audio feature and the second text.

可选的，第一获取单元，具体用于：Optionally, the first acquisition unit is specifically used for:

响应于监测到采集的语音数据中包含用户语音的情况下，启动图像采集组件，以获取输入语音对应的场景图像；In response to detecting that the collected voice data contains the user's voice, start the image acquisition component to obtain the scene image corresponding to the input voice;

可选的，还包括：Optionally, also include:

第四确定模块，用于根据第一音频特征及第一文本，确定待生成的答复语句对应的第二文本及第二文本中包含的表情符号；The fourth determination module is used to determine the second text corresponding to the answer sentence to be generated and the emoticons contained in the second text according to the first audio feature and the first text;

显示模块，用于在交互设备的显示屏幕上，显示第二文本及表情符号。The display module is used to display the second text and emoticons on the display screen of the interactive device.

需要说明的是，前述对语音对话的生成方法的解释说明也适用于本实施例的语音对话的生成装置，此处不再赘述。It should be noted that the foregoing explanations on the method for generating a voice dialogue are also applicable to the apparatus for generating a voice dialogue in this embodiment, and details are not repeated here.

根据本公开的实施例，本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。According to the embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.

图5示出了可以用来实施本公开的实施例的示例电子设备500的示意性框图。电子设备旨在表示各种形式的数字计算机，诸如，膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置，诸如，个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例，并且不意在限制本文中描述的和/或者要求的本公开的实现。FIG. 5 shows a schematic block diagram of an example electronic device 500 that may be used to implement embodiments of the present disclosure. Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

如图5所示，设备500包括计算单元501，其可以根据存储在只读存储器(ROM)502中的计算机程序或者从存储单元508加载到随机访问存储器(RAM)503中的计算机程序，来执行各种适当的动作和处理。在RAM 503中，还可存储设备500操作所需的各种程序和数据。计算单元501、ROM 502以及RAM 503通过总线504彼此相连。输入/输出(I/O)接口505也连接至总线504。As shown in FIG. 5 , the device 500 includes a computing unit 501 that can execute according to a computer program stored in a read-only memory (ROM) 502 or loaded from a storage unit 508 into a random-access memory (RAM) 503. Various appropriate actions and treatments. In the RAM 503, various programs and data necessary for the operation of the device 500 can also be stored. The computing unit 501 , ROM 502 and RAM 503 are connected to each other through a bus 504 . An input/output (I/O) interface 505 is also connected to the bus 504 .

设备500中的多个部件连接至I/O接口505，包括：输入单元506，例如键盘、鼠标等；输出单元507，例如各种类型的显示器、扬声器等；存储单元508，例如磁盘、光盘等；以及通信单元509，例如网卡、调制解调器、无线通信收发机等。通信单元509允许设备500通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Multiple components in the device 500 are connected to the I/O interface 505, including: an input unit 506, such as a keyboard, a mouse, etc.; an output unit 507, such as various types of displays, speakers, etc.; a storage unit 508, such as a magnetic disk, an optical disk, etc. ; and a communication unit 509, such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 509 allows the device 500 to exchange information/data with other devices over a computer network such as the Internet and/or various telecommunication networks.

计算单元501可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元501的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元501执行上文所描述的各个方法和处理，例如语音对话的生成。例如，在一些实施例中，语音对话的生成可被实现为计算机软件程序，其被有形地包含于机器可读介质，例如存储单元508。在一些实施例中，计算机程序的部分或者全部可以经由ROM 502和/或通信单元509而被载入和/或安装到设备500上。当计算机程序加载到RAM 503并由计算单元501执行时，可以执行上文描述的语音对话的生成的一个或多个步骤。备选地，在其他实施例中，计算单元501可以通过其他任何适当的方式(例如，借助于固件)而被配置为执行语音对话的生成。The computing unit 501 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 501 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 501 executes various methods and processes described above, such as the generation of voice dialogues. For example, in some embodiments, the generation of the spoken dialogue may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 508 . In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509 . When the computer program is loaded into the RAM 503 and executed by the computing unit 501, one or more steps of the above-described generation of the voice dialogue may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured in any other suitable way (for example, by means of firmware) to perform the generation of the voice dialogue.

本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括：实施在一个或者多个计算机程序中，该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释，该可编程处理器可以是专用或者通用可编程处理器，可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令，并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips Implemented in a system of systems (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor Can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.

用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器，使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行，作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

在本公开的上下文中，机器可读介质可以是有形的介质，其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备，或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

为了提供与用户的交互，可以在计算机上实施此处描述的系统和技术，该计算机具有：用于向用户显示信息的显示装置(例如，CRT(阴极射线管)或者LCD(液晶显示器)监视器)；以及键盘和指向装置(例如，鼠标或者轨迹球)，用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互；例如，提供给用户的反馈可以是任何形式的传感反馈(例如，视觉反馈、听觉反馈、或者触觉反馈)；并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide for interaction with the user, the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer. Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.

可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如，作为数据服务器)、或者包括中间件部件的计算系统(例如，应用服务器)、或者包括前端部件的计算系统(例如，具有图形用户界面或者网络浏览器的用户计算机，用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如，通信网络)来将系统的部件相互连接。通信网络的示例包括：局域网(LAN)、广域网(WAN)、互联网及区块链网络。The systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: local area networks (LANs), wide area networks (WANs), the Internet, and blockchain networks.

计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器，又称为云计算服务器或云主机，是云计算服务体系中的一项主机产品，以解决了传统物理主机与VPS服务("Virtual Private Server"，或简称"VPS")中，存在的管理难度大，业务扩展性弱的缺陷。服务器也可以为分布式系统的服务器，或者是结合了区块链的服务器。A computer system may include clients and servers. Clients and servers are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also known as cloud computing server or cloud host, which is a host product in the cloud computing service system to solve the problem of traditional physical host and VPS service ("Virtual Private Server", or "VPS") Among them, there are defects such as difficult management and weak business scalability. The server can also be a server of a distributed system, or a server combined with a blockchain.

本实施例中，先对获取的输入语音进行语音识别，以确定输入语音对应的第一文本，之后对输入语音进行音频特征提取，以确定输入语音对应的第一音频特征，再根据第一音频特征及第一文本，确定待生成的答复语句对应的第二文本及第二音频特征，最后基于第二音频特征及第二文本，生成答复语音。由此，根据输入语音对应的第一音频特征及第一文本，确定答复语句对应的第二文本及第二音频特征，从而不仅提高了确定的第二文本的准确性，而且可以根据输入语句对应的情绪特征确定答复语句的情绪特征，从而使生成的答复语音更加贴合输入语音对应的说话者的情绪。In this embodiment, speech recognition is first performed on the acquired input speech to determine the first text corresponding to the input speech, and then audio feature extraction is performed on the input speech to determine the first audio feature corresponding to the input speech, and then according to the first audio feature and the first text, determine the second text and the second audio feature corresponding to the answer sentence to be generated, and finally generate the answer voice based on the second audio feature and the second text. Thus, according to the first audio feature and the first text corresponding to the input voice, the second text and the second audio feature corresponding to the answer sentence are determined, thereby not only improving the accuracy of the determined second text, but also corresponding to the input sentence. The emotional feature of the response sentence determines the emotional feature of the reply sentence, so that the generated reply speech is more suitable for the emotion of the speaker corresponding to the input speech.

应该理解，可以使用上面所示的各种形式的流程，重新排序、增加或删除步骤。例如，本发公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行，只要能够实现本公开公开的技术方案所期望的结果，本文在此不进行限制。It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.

此外，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本公开的描述中，“多个”的含义是至少两个，例如两个，三个等，除非另有明确具体的限定。在本公开的描述中，所使用的词语“如果”及“若”可以被解释成为“在……时”或“当……时”或“响应于确定”或“在……情况下”。In addition, the terms "first" and "second" are used for descriptive purposes only, and cannot be interpreted as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features. Thus, the features defined as "first" and "second" may explicitly or implicitly include at least one of these features. In the description of the present disclosure, "plurality" means at least two, such as two, three, etc., unless otherwise specifically defined. In the description of the present disclosure, the words "if" and "if" as used may be interpreted as "at" or "when" or "in response to a determination" or "under circumstances".

上述具体实施方式，并不构成对本公开保护范围的限制。本领域技术人员应该明白的是，根据设计要求和其他因素，可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等，均应包含在本公开保护范围之内。The specific implementation manners described above do not limit the protection scope of the present disclosure. It should be apparent to those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made depending on design requirements and other factors. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present disclosure shall be included within the protection scope of the present disclosure.

Claims

1. A method for generating a speech dialogue, comprising:

performing speech recognition on the acquired input speech to determine the first text corresponding to the input speech;

performing audio feature extraction on the input speech to determine a first audio feature corresponding to the input speech;

According to the first audio feature and the first text, determine the second text and the second audio feature corresponding to the answer sentence to be generated, wherein the first audio feature and the first text are input into a preset dialogue model In order to obtain the second text and the second audio feature corresponding to the reply sentence to be generated;

generating a reply voice based on the second audio feature and the second text;

The method for generating the preset dialogue model includes: obtaining a training sample set, wherein the training sample set includes input text and corresponding audio features, and the input text corresponds to annotated answer text and corresponding audio feature labels. The input text and corresponding audio features are input into the initial dialogue model to obtain the predicted answer text output by the initial dialogue model and the corresponding predicted audio features, according to the difference between the predicted answer text and the marked answer text, and The difference between the predicted audio feature and the audio feature label corrects the initial dialogue model to generate the preset dialogue model, wherein the audio feature includes a frequency feature and an amplitude feature;

Said generating a reply voice based on said second audio feature and said second text includes:

Acquiring a scene image corresponding to the input voice, the scene image including the scene where the speaker of the input voice is located;

Performing visual feature extraction on the scene image to determine visual features corresponding to the scene image, wherein the visual feature is a scene feature contained in the scene image;

modifying the second text and/or second audio features based on the visual features;

generating a reply voice based on the modified second audio feature and the second text;

The determining the visual feature corresponding to the scene image includes:

Automatically segment the scene image, divide the object or color area contained in the scene image, extract features for each image sub-block, and establish an index, so as to obtain the corresponding spatial relationship of each object in the scene image feature, based on the type and spatial relationship of each object, determine the visual feature corresponding to the scene image.

2. The method according to claim 1, wherein the audio feature extraction is performed on the input speech to determine the first audio feature corresponding to the input speech, comprising:

Determine a second amplitude corresponding to the input voice according to the first amplitude corresponding to each frame of voice in the input voice;

Determine the amplitude feature corresponding to the input speech according to the range to which the second amplitude belongs.

3. The method according to claim 2, wherein said performing audio feature extraction on said input speech, to determine the first audio feature corresponding to said input speech, comprising:

performing pitch detection on the input voice to determine a frequency value corresponding to the voice signal;

According to the range to which the frequency value belongs, the frequency feature corresponding to the input voice is determined.

4. The method according to claim 1, wherein said obtaining the scene image corresponding to said input voice comprises:

In response to detecting that the collected voice data contains the user's voice, start the image acquisition component to obtain the scene image corresponding to the input voice;

Or, according to the acquisition time of the input voice, intercept the scene image corresponding to the input voice from the collected video stream.

5. The method according to any one of claims 1-3, wherein, after determining the first audio feature corresponding to the input voice, further comprising:

According to the first audio feature and the first text, determine the second text corresponding to the reply sentence to be generated and the emoticons contained in the second text;

On the display screen of the interactive device, the second text and the emoticon are displayed.

6. A device for generating a speech dialogue, comprising:

A first determining module, configured to perform voice recognition on the acquired input voice, so as to determine the first text corresponding to the input voice;

The second determination module is used to perform audio feature extraction on the input speech, so as to determine the first audio feature corresponding to the input speech;

The third determination module is used to determine the second text and the second audio feature corresponding to the answer sentence to be generated according to the first audio feature and the first text, wherein the first audio feature and the first Text input into the preset dialogue model to obtain the second text and the second audio feature corresponding to the reply sentence to be generated;

A generating module, configured to generate a reply voice based on the second audio feature and the second text;

Wherein, the generating module includes:

A first acquiring unit, configured to acquire a scene image corresponding to the input voice, where the scene image includes the scene where the speaker of the input voice is located;

The first determination unit is configured to perform visual feature extraction on the scene image, so as to determine the visual feature corresponding to the scene image, wherein the visual feature is a scene feature contained in the scene image;

a correction unit, configured to correct the second text and/or the second audio feature according to the visual feature;

A generating unit, configured to generate a reply voice based on the modified second audio feature and the second text;

The determining the visual feature corresponding to the scene image includes:

7. The device according to claim 6, wherein the second determining module is specifically configured to:

8. The device according to claim 7, wherein the second determination module is specifically configured to:

9. The device according to claim 6, wherein the first acquiring unit is specifically configured to:

10. The device according to any one of claims 6-8, further comprising:

The fourth determination module is used to determine the second text corresponding to the answer sentence to be generated and the emoticons contained in the second text according to the first audio feature and the first text;

A display module, configured to display the second text and the emoticons on the display screen of the interactive device.

11. An electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform any one of claims 1-5. Methods.

12. A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method according to any one of claims 1-5.

13. A computer program product comprising computer instructions, which implement the steps of the method according to any one of claims 1-5 when executed by a processor.