CN107910004A

CN107910004A - Speech translation processing method and device

Info

Publication number: CN107910004A
Application number: CN201711107221.9A
Authority: CN
Inventors: 刘俊华; 魏思; 胡国平; 柳林; 王建社; 方昕; 李永超; 孟廷
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2017-11-10
Filing date: 2017-11-10
Publication date: 2018-04-13

Abstract

The embodiment of the invention provides a speech translation processing method and device, and belongs to the technical field of language processing. The method comprises the following steps: in the process of broadcasting the first synthesized voice signal, if a mixed voice signal containing part of the first synthesized voice signal is received, the broadcasting of the first synthesized voice signal is stopped. And filtering part of the first synthesized voice signals from the mixed voice signals to obtain the voice signals to be translated in the current round, and using the voice signals as target voice signals. And acquiring a second synthesized voice signal based on the target voice signal, and broadcasting the second synthesized voice signal. When a mixed voice signal containing part of the first synthesized voice signal is received, the broadcasting of the first synthesized voice signal is stopped, and the broadcasting of the second synthesized voice signal is also stopped. Because any party in the communication process can interrupt the broadcasting process at any time according to the full-duplex mode, the broadcasting process does not need to be waited to finish each time, and the communication efficiency can be improved.

Description

Speech translation processing method and device

技术领域technical field

本发明实施例涉及语言处理技术领域，更具体地，涉及一种语音翻译处理方法及装置。Embodiments of the present invention relate to the technical field of language processing, and more specifically, to a speech translation processing method and device.

背景技术Background technique

目前，语言沟通成为不同种族群体在相互交流时所面临的一个重要课题。例如，在双人或多人会议中，可通过自动语音翻译系统实现语音翻译。其中，自动语音翻译系统通常由语音识别、机器翻译和语音合成三部分组成。源语种语音信号通过语音识别得到源语种文本数据，然后通过机器翻译将源语种文本数据翻译成目标语种文本数据，最后通过对目标语种文本数据进行语音合成，得到目标语种语音信号并进行播报。在进行语音翻译时，需要等到上一轮次的目标语种语音信号播报完后，才可以翻译下一轮次的源语种语音信号。At present, language communication has become an important issue faced by different ethnic groups when they communicate with each other. For example, in a two-person or multi-person meeting, voice translation can be realized through an automatic voice translation system. Among them, the automatic speech translation system usually consists of three parts: speech recognition, machine translation and speech synthesis. The source language voice signal is obtained through speech recognition to obtain the source language text data, and then the source language text data is translated into the target language text data through machine translation, and finally the target language text data is synthesized by speech to obtain the target language voice signal and broadcast. When performing voice translation, it is necessary to wait until the last round of target language voice signals are broadcast before translating the next round of source language voice signals.

由于在即时沟通交流中，上一轮次的播报过程在进行一半时收听方可能就能够明白讲话方所要讲的话的含义，而收听方需要等待上一轮次的播报过程结束后才能继续下一轮次的沟通，从而沟通效率较低。Because in real-time communication, the listener may be able to understand the meaning of what the speaker wants to say when the last round of broadcast process is halfway through, and the listener needs to wait for the end of the last round of broadcast process before proceeding to the next round. Rounds of communication, resulting in low communication efficiency.

发明内容Contents of the invention

为了解决上述问题，本发明实施例提供一种克服上述问题或者至少部分地解决上述问题的语音翻译处理方法及装置。In order to solve the above problems, an embodiment of the present invention provides a voice translation processing method and device for overcoming the above problems or at least partially solving the above problems.

根据本发明实施例的第一方面，提供了一种语音翻译处理方法，该方法包括：According to a first aspect of an embodiment of the present invention, a speech translation processing method is provided, the method comprising:

在对第一合成语音信号进行播报的过程中，若接收到包含有部分第一合成语音信号的混合语音信号，则停止播报第一合成语音信号，第一合成语音信号为经由上一轮次翻译以及语音合成后所得到的；In the process of broadcasting the first synthesized speech signal, if a mixed speech signal containing part of the first synthesized speech signal is received, the broadcast of the first synthesized speech signal is stopped, and the first synthesized speech signal is translated through the last round and what is obtained after speech synthesis;

从混合语音信号过滤掉部分第一合成语音信号，得到当前轮次待翻译的语音信号，并作为目标语音信号；Filter out part of the first synthesized speech signal from the mixed speech signal, obtain the speech signal to be translated in the current round, and use it as the target speech signal;

基于目标语音信号，获取第二合成语音信号，并播报第二合成语音信号，第二合成语音信号是对目标语音信号进行翻译以及语音合成后所得到的。Based on the target voice signal, a second synthesized voice signal is obtained, and the second synthesized voice signal is broadcast. The second synthesized voice signal is obtained by translating and voice-synthesizing the target voice signal.

本发明实施例提供的方法，通过在对第一合成语音信号进行播报的过程中，若接收到包含有部分第一合成语音信号的混合语音信号，则停止播报第一合成语音信号。从混合语音信号过滤掉部分第一合成语音信号，得到当前轮次待翻译的语音信号，并作为目标语音信号。基于目标语音信号，获取第二合成语音信号，并播报第二合成语音信号。由于沟通过程中的任意一方，均可按照全双工模式随时打断播报过程，而不用每次都等到一轮播报过程结束，从而在提高沟通效率的同时，还可使得不同语种用户之间沟通更加自然流畅。In the method provided by the embodiment of the present invention, in the process of broadcasting the first synthesized voice signal, if a mixed voice signal including part of the first synthesized voice signal is received, the broadcast of the first synthesized voice signal is stopped. Part of the first synthesized speech signal is filtered out from the mixed speech signal to obtain the speech signal to be translated in the current round as the target speech signal. Obtain a second synthesized voice signal based on the target voice signal, and broadcast the second synthesized voice signal. Since any party in the communication process can interrupt the broadcast process at any time according to the full-duplex mode, instead of waiting for a round of broadcast process to end every time, it can improve communication efficiency and enable users of different languages to communicate More natural flow.

结合第一方面的第一种可能的实现方式，在第二种可能的实现方式中，第一合成语音信号与目标语音信号为相同的语种类型或者第一合成语音信号与目标语音信号为不同的语种类型。With reference to the first possible implementation of the first aspect, in the second possible implementation, the first synthesized speech signal and the target speech signal are of the same language type or the first synthesized speech signal and the target speech signal are different Language type.

结合第一方面的第一种可能的实现方式，在第三种可能的实现方式中，播报第二合成语音信号之前，还包括：In combination with the first possible implementation of the first aspect, in a third possible implementation, before broadcasting the second synthesized voice signal, the method further includes:

获取对目标语音信号进行语音识别后得到的识别文本数据，获取对识别文本数据进行翻译后得到的目标文本数据，并对目标文本数据进行语音合成，得到第二合成语音信号。Acquiring recognized text data obtained after performing speech recognition on the target speech signal, obtaining target text data obtained after translating the recognized text data, and performing speech synthesis on the target text data to obtain a second synthesized speech signal.

结合第一方面的第三种可能的实现方式，在第四种可能的实现方式中，获取对识别文本数据进行翻译后得到的目标文本数据，包括：In combination with the third possible implementation of the first aspect, in a fourth possible implementation, the target text data obtained after translating the recognized text data includes:

确定识别文本数据对应的源语种类型，并按照预设对应关系确定源语种类型对应的目标语种类型；Determine the source language type corresponding to the recognized text data, and determine the target language type corresponding to the source language type according to the preset correspondence relationship;

将目标语种类型及识别文本数据输入至翻译编解码模型，输出目标文本数据。Input the target language type and recognized text data into the translation codec model, and output the target text data.

结合第一方面的第四种可能的实现方式，在第五种可能的实现方式中，基于目标语音信号中的声纹特征，确定声纹特征对应的预设语种类型，以作为源语种类型对应的目标语种类型。In combination with the fourth possible implementation of the first aspect, in the fifth possible implementation, based on the voiceprint features in the target voice signal, determine the preset language type corresponding to the voiceprint feature as the corresponding source language type target language type.

结合第一方面的第三种可能的实现方式，在第六种可能的实现方式中，获取对识别文本数据进行翻译后得到的目标文本数据，包括：In combination with the third possible implementation of the first aspect, in a sixth possible implementation, the target text data obtained after translating the recognized text data includes:

若判断获知目标语音信号与第一合成语音信号分别传递的信息相互关联，则基于当前轮次之前轮次的语音信号和/或翻译结果，对识别文本数据进行翻译，以得到目标文本数据。If it is determined that the information conveyed by the target speech signal and the first synthesized speech signal are related to each other, the recognized text data is translated based on the speech signal and/or the translation result of the round before the current round to obtain the target text data.

结合第一方面的第三种可能的实现方式，在第七种可能的实现方式中，对目标文本数据进行语音合成，得到第二合成语音信号，包括：In combination with the third possible implementation of the first aspect, in a seventh possible implementation, speech synthesis is performed on the target text data to obtain a second synthesized speech signal, including:

获取语音播报参数，将目标文本数据及语音播报参数输入至语音合成模型，输出第二合成语音信号，语音播报参数中至少包括播报第二合成语音信号时所使用的音色参数。Acquire speech broadcast parameters, input the target text data and speech broadcast parameters into the speech synthesis model, and output the second synthesized speech signal, the speech broadcast parameters include at least the timbre parameters used when broadcasting the second synthesized speech signal.

根据本发明实施例的第二方面，提供了一种语音翻译处理装置，该装置包括：According to a second aspect of the embodiments of the present invention, there is provided a speech translation processing device, the device comprising:

停止播报模块，用于在对第一合成语音信号进行播报的过程中，若接收到包含有部分第一合成语音信号的混合语音信号，则停止播报第一合成语音信号，第一合成语音信号为经由上一轮次翻译以及语音合成后所得到的；Stop the broadcast module, for broadcasting the first synthesized voice signal, if a mixed voice signal containing part of the first synthesized voice signal is received, stop broadcasting the first synthesized voice signal, the first synthesized voice signal is Obtained after the last round of translation and speech synthesis;

过滤模块，用于从混合语音信号过滤掉部分第一合成语音信号，得到当前轮次待翻译的语音信号，并作为目标语音信号；A filter module, configured to filter out part of the first synthesized speech signal from the mixed speech signal, to obtain the speech signal to be translated in the current round, and use it as the target speech signal;

播报模块，用于基于目标语音信号，获取第二合成语音信号，并播报第二合成语音信号，第二合成语音信号是对目标语音信号进行翻译以及语音合成后所得到的。The broadcasting module is used to obtain a second synthesized voice signal based on the target voice signal, and broadcast the second synthesized voice signal. The second synthesized voice signal is obtained by translating and speech-synthesizing the target voice signal.

根据本发明实施例的第三方面，提供了一种语音翻译处理设备，包括：According to a third aspect of the embodiments of the present invention, a speech translation processing device is provided, including:

至少一个处理器；以及at least one processor; and

与处理器通信连接的至少一个存储器，其中：at least one memory communicatively coupled to the processor, wherein:

存储器存储有可被处理器执行的程序指令，处理器调用程序指令能够执行第一方面的各种可能的实现方式中任一种可能的实现方式所提供的语音翻译处理方法。The memory stores program instructions executable by the processor, and the processor invokes the program instructions to execute the speech translation processing method provided in any one of the various possible implementations of the first aspect.

根据本发明的第四方面，提供了一种非暂态计算机可读存储介质，非暂态计算机可读存储介质存储计算机指令，计算机指令使计算机执行第一方面的各种可能的实现方式中任一种可能的实现方式所提供的语音翻译处理方法。According to a fourth aspect of the present invention, a non-transitory computer-readable storage medium is provided, the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions cause the computer to execute any of the various possible implementations of the first aspect. A speech translation processing method provided by a possible implementation manner.

应当理解的是，以上的一般描述和后文的细节描述是示例性和解释性的，并不能限制本发明实施例。It should be understood that the foregoing general description and the following detailed description are exemplary and explanatory, and are not intended to limit the embodiments of the present invention.

附图说明Description of drawings

图1为本发明实施例的一种语音翻译处理方法的流程示意图；Fig. 1 is a schematic flow chart of a speech translation processing method according to an embodiment of the present invention;

图2为本发明实施例的一种语音翻译处理方法的流程示意图；Fig. 2 is a schematic flow chart of a speech translation processing method according to an embodiment of the present invention;

图3为本发明实施例的一种语音翻译处理方法的流程示意图；Fig. 3 is a schematic flow chart of a speech translation processing method according to an embodiment of the present invention;

图4为本发明实施例的一种语音翻译处理装置的框图；4 is a block diagram of a speech translation processing device according to an embodiment of the present invention;

图5为本发明实施例的一种语音翻译处理设备的框图。Fig. 5 is a block diagram of a speech translation processing device according to an embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图和实施例，对本发明实施例的具体实施方式作进一步详细描述。以下实施例用于说明本发明实施例，但不用来限制本发明实施例的范围。The specific implementation manners of the embodiments of the present invention will be further described in detail below in conjunction with the drawings and embodiments. The following examples are used to illustrate the embodiments of the present invention, but are not intended to limit the scope of the embodiments of the present invention.

目前，不同语种的人在进行沟通交流时，通常是通过自动语音翻译系统实现。其中，自动语音翻译系统通常由语音识别、机器翻译和语音合成三部分组成。源语种语音信号通过语音识别得到源语种文本数据，然后通过机器翻译将源语种文本数据翻译成目标语种文本数据，最后通过对目标语种文本数据进行语音合成，得到目标语种语音信号并进行播报。在进行语音翻译时，需要等到上一轮次的目标语种语音信号播报完后，才可以进行下一轮次的翻译、语音合成及播报。At present, when people of different languages communicate, it is usually realized through an automatic voice translation system. Among them, the automatic speech translation system usually consists of three parts: speech recognition, machine translation and speech synthesis. The source language voice signal is obtained through speech recognition to obtain the source language text data, and then the source language text data is translated into the target language text data through machine translation, and finally the target language text data is synthesized by speech to obtain the target language voice signal and broadcast. When performing speech translation, it is necessary to wait until the last round of speech signals in the target language is broadcasted before the next round of translation, speech synthesis and broadcasting can be performed.

例如，用户A与用户B之间沟通交流，用户A讲英语，用户B讲汉语。用户A讲一句英语，经过翻译以及语音合成的过程得到一句汉语，并进行播报。完成整句播报后，用户A可继续讲一句英语或者由用户B讲一句汉语，并重复上述翻译、语音合成以及播报的过程。也即，用户A与用户B需要等到系统播报结束后，才能接收新的语音数据，并进行翻译、语音合成及播报。For example, user A communicates with user B, user A speaks English, and user B speaks Chinese. User A speaks a sentence of English, gets a sentence of Chinese through the process of translation and speech synthesis, and broadcasts it. After finishing the broadcast of the whole sentence, user A can continue to speak a sentence of English or user B can speak a sentence of Chinese, and repeat the above-mentioned process of translation, speech synthesis and broadcast. That is to say, user A and user B need to wait until the end of the system broadcast to receive new voice data, and perform translation, speech synthesis and broadcast.

考虑到用户讲完一句语音数据后，可能需要对刚讲的语音数据进行补充或修改。另外，在播报合成语音信号时，听播报的用户可能不需要听完就能够明白讲话用户的意图。对于上述情形，若按照上述流程完成上一轮次的整句播报，再进行下一轮次的翻译、语音合成及播报，则会比较耗费时间。针对上述问题，本发明实施例提供了一种语音翻译处理方法。该方法可应用于带有语音采集、翻译、合成及播报功能的终端或系统，可应用于两人或者多人沟通场景，本发明实施例对此不作具体限定。参见图1，该方法包括：101、在对第一合成语音信号进行播报的过程中，若接收到包含有部分第一合成语音信号的混合语音信号，则停止播报第一合成语音信号，第一合成语音信号为经由上一轮次翻译以及语音合成后所得到的；102、从混合语音信号过滤掉部分第一合成语音信号，得到当前轮次待翻译的语音信号，并作为目标语音信号；103、基于目标语音信号，获取第二合成语音信号，并播报第二合成语音信号，第二合成语音信号是对目标语音信号进行翻译以及语音合成后所得到的。Considering that after the user finishes speaking a sentence of voice data, it may be necessary to supplement or modify the voice data just spoken. In addition, when the synthesized voice signal is broadcast, the user who listens to the broadcast may be able to understand the intention of the speaking user without listening to the end. For the above situation, if the previous round of full-sentence broadcast is completed according to the above process, and then the next round of translation, speech synthesis and broadcast is performed, it will be more time-consuming. In view of the above problems, an embodiment of the present invention provides a speech translation processing method. The method can be applied to a terminal or system with functions of voice collection, translation, synthesis, and broadcast, and can be applied to two or more communication scenarios, which is not specifically limited in the embodiments of the present invention. Referring to Fig. 1, the method includes: 101. In the process of broadcasting the first synthesized speech signal, if a mixed speech signal containing part of the first synthesized speech signal is received, stop broadcasting the first synthesized speech signal, and first The synthesized speech signal is obtained after the last round of translation and speech synthesis; 102. Filter out part of the first synthesized speech signal from the mixed speech signal to obtain the speech signal to be translated in the current round and use it as the target speech signal; 103 . Obtain a second synthesized voice signal based on the target voice signal, and broadcast the second synthesized voice signal, where the second synthesized voice signal is obtained by translating and speech-synthesizing the target voice signal.

在上述步骤101中，第一合成语音信号为上一轮次的源语种信号经过采集、翻译及语音合成后所得到。在对第一合成语音信号进行播报的过程中，可同时监听是否有新的源语种语音信号，也即监听是否有用户又讲了需要翻译播报的话。具体地，可通过开启一个监听线程监听，本发明实施例对此不作具体限定。在监听过程中，由于上一轮次的第一合成语音信号还在播报，从而监听到的语音信号除了包含有新的源语种语音信号(即用户新发的言)之外，还存在部分第一合成语音信号，从而接收到的是包含有部分第一合成语音信号的混合语音信号。在接收到包含有部分第一合成语音信号的混合语音信号后，说明此时有用户发了言，该发言可能是上一轮次讲话的用户需要补充或修改，从而打断第一合成语音信号的播报过程进行发言。或者，还可能是听播报的用户没有听完就明白了上一轮次讲话的用户所传达的意图，从而打断第一合成语音信号的播报过程进行发言。In the above step 101, the first synthesized speech signal is obtained after collecting, translating and speech-synthesizing the source language signal of the previous round. In the process of broadcasting the first synthesized voice signal, it is possible to simultaneously monitor whether there is a new voice signal in the source language, that is, monitor whether any user speaks something that needs to be translated and broadcasted. Specifically, monitoring may be performed by starting a monitoring thread, which is not specifically limited in this embodiment of the present invention. During the monitoring process, since the first synthesized voice signal of the previous round is still broadcasting, the monitored voice signal contains not only the new source language voice signal (that is, the user's new speech), but also some first synthesized voice signals. A synthesized speech signal, whereby a mixed speech signal comprising a portion of the first synthesized speech signal is received. After receiving the mixed voice signal containing part of the first synthesized voice signal, it means that a user has made a speech at this time, and the speech may be supplemented or modified by the user who spoke in the last round, thereby interrupting the first synthesized voice signal speak during the broadcast process. Or, it is also possible that the user who listened to the broadcast understands the intention conveyed by the user who spoke in the previous round before finishing listening, and thus interrupts the broadcast process of the first synthesized voice signal to speak.

由于混合语音信号中除了包含有部分第一合成语音信号的混合语音信号之外，还包含当前轮次待翻译的语音信号，而后续需要对当前轮次待翻译的语音信号进行翻译、语音合成及播报，从而在上述步骤102中，需要从混合语音信号中过滤掉部分第一合成语音信号，从而得到当前轮次待翻译的语音信号。本发明实施例不对从混合语音信号中过滤掉部分第一合成语音信号的方式作具体限定，包括但不限于通过回声消除的方式从混合语音信号中过滤掉部分第一合成语音信号。其中，回声消除的计算过程可如下所示：Since the mixed speech signal includes not only the mixed speech signal containing part of the first synthesized speech signal, but also the speech signal to be translated in the current round, and the subsequent speech signal to be translated in the current round needs to be translated, speech synthesized and Broadcasting, so in the above step 102, part of the first synthesized speech signal needs to be filtered out from the mixed speech signal, so as to obtain the speech signal to be translated in the current round. The embodiment of the present invention does not specifically limit the manner of filtering out part of the first synthesized speech signal from the mixed speech signal, including but not limited to filtering out part of the first synthesized speech signal from the mixed speech signal by means of echo cancellation. Among them, the calculation process of echo cancellation can be shown as follows:

以监听设备为麦克风为例，假设播报的部分第一合成语音信号为s(t)，第m个麦克风接收的信道传输函数为h_m(t)，用户新输入的待翻译的语音信号为x_m(t)，则麦克风接收到的观测信号y_m(t)，如下列公式所示：Taking the monitoring device as a microphone as an example, assuming that the broadcast part of the first synthesized speech signal is s(t), the channel transfer function received by the mth microphone is h _m (t), and the speech signal to be translated newly input by the user is x _m (t), then the observed signal y _m (t) received by the microphone is shown in the following formula:

y_m(t)＝s(t)*h_m(t)+x_m(t)y _m (t)=s(t)*h _m (t)+x _m (t)

当没有新输入的待翻译的语音信号x_m(t)时，可提前估计出信道传输函数h_m(t)。当有新输入的待翻译的语音信号x_m(t)时，可对混合语音信号进行回声消除。由于y_m(t)，s(t)，h_m(t)已知，从而可以通过如下公式计算得到当前轮次待翻译的语音信号，并作为目标语音信号，具体公式如下：When there is no new input speech signal x _m (t) to be translated, the channel transfer function h _m (t) can be estimated in advance. When there is a new input speech signal x _m (t) to be translated, echo cancellation can be performed on the mixed speech signal. Since y _m (t), s (t), and h _m (t) are known, the speech signal to be translated in the current round can be calculated by the following formula and used as the target speech signal. The specific formula is as follows:

x′_m(t)＝y(t)-s(t)*h_m(t)x' _m (t)=y(t)-s(t)*h _m (t)

在得到目标语音信号后，可对目标语音信号进行翻译及语音合成，以得到当前轮次的第二合成语音信号，并播报第二合成语音信号。After the target voice signal is obtained, the target voice signal can be translated and speech synthesized to obtain the second synthesized voice signal of the current round, and the second synthesized voice signal is broadcast.

基于上述实施例的内容，沟通过程中的任意一方均可根据需求打断播报过程，从而讲话的用户可以打断自己讲话的播报过程，听播报的用户可以打断上一轮次用户讲话的播报过程。相应地，作为一种可选实施例，第一合成语音信号与目标语音信号为相同的语种类型或者第一合成语音信号与目标语音信号为不同的语种类型。Based on the content of the above-mentioned embodiments, any party in the communication process can interrupt the broadcast process according to the needs, so that the speaking user can interrupt the broadcast process of his own speech, and the user listening to the broadcast can interrupt the broadcast of the last round of user speech process. Correspondingly, as an optional embodiment, the first synthesized speech signal and the target speech signal are of the same language type, or the first synthesized speech signal and the target speech signal are of different language types.

其中，当第一合成语音信号翻译前的语音信号与目标语音信号为相同的语种类型时，则说明可能是上一轮次讲话的用户打断了上一轮次自己讲话的播报过程，或者还能是当前轮次讲话的用户与上一轮次讲话的用户所使用的语种类型相同，且当前轮次讲话的用户打断了上一轮次用户讲话的播报过程。当第一合成语音信号与目标语音信号为不同的语种类型时，则说明是可能是听播报的用户打断了上一轮次用户讲话的播报过程，或者还可能是上一轮次讲话的用户打断了上一轮次自己讲话的播报过程，并使用了与上一轮次不同的语种类型讲话。基于上述内容，本发明实施例可适用于双人或者多人之间不同语种的沟通场景，如双人或多人会议等。Wherein, when the speech signal before the translation of the first synthesized speech signal is of the same language type as the target speech signal, it may be that the user who spoke in the last round interrupted the broadcast process of the last round of his own speech, or It may be that the user speaking in the current round uses the same language type as the user speaking in the previous round, and the user speaking in the current round interrupts the broadcast process of the user speaking in the previous round. When the first synthesized speech signal and the target speech signal are of different language types, it may be that the user listening to the broadcast interrupted the broadcast process of the last round of user speech, or it may also be the user who spoke last round Interrupted the broadcast process of the last round of his own speech, and spoke in a language type different from the previous round. Based on the above content, the embodiments of the present invention are applicable to communication scenarios in different languages between two or more people, such as two or more people conferences.

基于上述实施例的内容，在播报第二合成语音信号之前，需要得到第二合成语音信号。相应地，作为一种可选实施例，本发明实施例提供了一种获取合成语音信号的方法。参见图2，该方法包括：201、获取对目标语音信号进行语音识别后得到的识别文本数据；202、获取对识别文本数据进行翻译后得到的目标文本数据；203、对目标文本数据进行语音合成，得到第二合成语音信号。Based on the content of the foregoing embodiments, before broadcasting the second synthesized voice signal, it is necessary to obtain the second synthesized voice signal. Correspondingly, as an optional embodiment, an embodiment of the present invention provides a method for obtaining a synthesized speech signal. Referring to Fig. 2, the method includes: 201, obtaining the recognition text data obtained after performing speech recognition on the target voice signal; 202, obtaining the target text data obtained after translating the recognition text data; 203, performing speech synthesis on the target text data , to obtain the second synthesized speech signal.

基于上述实施例的内容，作为一种可选实施例，本发明实施例还提供了一种获取目标文本数据的方法。参见图3，该方法包括：2011、确定识别文本数据对应的源语种类型，并按照预设对应关系确定源语种类型对应的目标语种类型；2012、将目标语种类型及识别文本数据输入至翻译编解码模型，输出目标文本数据。Based on the content of the foregoing embodiments, as an optional embodiment, an embodiment of the present invention further provides a method for acquiring target text data. Referring to Fig. 3, the method includes: 2011, determine the source language type corresponding to the recognition text data, and determine the target language type corresponding to the source language type according to the preset correspondence; 2012, input the target language type and the recognition text data into the translation editor Decode the model and output the target text data.

在上述步骤2011中，需先确定识别文本数据对应的源语种类型。本发明实施例不对确定识别文本数据对应的源语种类型的方式作具体限定，包括但不限于如下两种方式。In the above step 2011, it is first necessary to determine the source language type corresponding to the recognized text data. The embodiment of the present invention does not specifically limit the manner of determining the source language type corresponding to the text data, including but not limited to the following two manners.

第一种方式：基于目标语音信号的声学特征确定。The first way: based on the acoustic feature determination of the target speech signal.

具体地，可提取目标语音信号的声学特征，如频谱特征：梅尔倒谱系数(Mel-Frequency Cepstral Coefficients，MFCC)、感知线性预测系数(Linear PredictiveCoding，PLP)等，将声学特征输入至语种识别模型，对目标语音信号进行语种预测。语种识别模型的输出结果即为目标语音信号为每个语种类型的概率，选择概率最大的语种作为目标语音信号对应的语种，也即确定了识别文本数据对应的源语种类型。其中，语种识别模型一般为模式识别中常用分类模型，具体可以通过预先收集大量语音信号，提取每条语音信号的声学特征，标注每条语音信号的语种类型构建得到。Specifically, the acoustic features of the target speech signal can be extracted, such as spectral features: Mel-Frequency Cepstral Coefficients (MFCC), perceptual linear predictive coefficients (Linear Predictive Coding, PLP), etc., and the acoustic features can be input to language recognition A model to predict the language of the target speech signal. The output result of the language recognition model is the probability that the target speech signal is each language type, and the language with the highest probability is selected as the language corresponding to the target speech signal, that is, the source language type corresponding to the recognized text data is determined. Among them, the language recognition model is generally a classification model commonly used in pattern recognition. Specifically, it can be constructed by collecting a large number of speech signals in advance, extracting the acoustic features of each speech signal, and marking the language type of each speech signal.

第二种方式，基于目标语音信号的识别结果确定。The second way is to determine based on the recognition result of the target speech signal.

具体地，分别利用当前涉及的每个语种对应的语音识别模型对目标语音信号进行语音识别，得到目标语音信号对应每个语种的识别文本数据及相应的识别置信度，选择识别置信度最大的识别文本数据对应语种作为目标语音信号的语种。其中，语音识别过程一般为：先对目标语音信号进行端点检测，得到有效语音段的起始点和结束点。然后对端点检测得到的有效语音段进行特征提取，再利用提取的特征数据及预先训练的声学模型和语言模型进行解码，得到当前语音数据对应识别文本及相应识别文本的置信度。Specifically, use the speech recognition model corresponding to each language currently involved to perform speech recognition on the target speech signal, obtain the recognition text data corresponding to each language of the target speech signal and the corresponding recognition confidence, and select the recognition model with the highest recognition confidence. The language corresponding to the text data is used as the language of the target speech signal. Wherein, the voice recognition process generally includes: first, endpoint detection is performed on the target voice signal to obtain the starting point and the ending point of the effective voice segment. Then, feature extraction is performed on the effective speech segment obtained by endpoint detection, and then the extracted feature data and the pre-trained acoustic model and language model are used for decoding to obtain the corresponding recognition text of the current speech data and the confidence of the corresponding recognition text.

例如，假设目标语音信号对应语种为中文；当前涉及的语种为中文和英文。对目标语音信号进行语种识别时，分别利用中文语音识别模型和英文语音识别模型对目标语音信号进行语音识别，得到目标语音信号对应的中文识别文本数据及相应识别置信度0.9，英文识别文本数据和相应识别置信度0.2。选择识别置信度较大的识别文本数据对应语种，即中文作为目标语音信号对应语种。进一步地，还可以将每个语种对应识别文本数据的识别置信度及语言模型得分进行融合，选择融合得分最大的识别文本数据所对应的语种作为目标语音信号对应的语种。其中，融合方法可以为线性加权方法，本发明实施例对此不作具体限定。For example, assume that the language corresponding to the target voice signal is Chinese; the currently involved languages are Chinese and English. When performing language recognition on the target speech signal, use the Chinese speech recognition model and the English speech recognition model to perform speech recognition on the target speech signal, and obtain the Chinese recognition text data corresponding to the target speech signal and the corresponding recognition confidence of 0.9, and the English recognition text data and The corresponding recognition confidence is 0.2. Select the corresponding language of the recognized text data with higher recognition confidence, that is, Chinese as the corresponding language of the target voice signal. Further, the recognition confidence and language model score of the recognized text data corresponding to each language can be fused, and the language corresponding to the recognized text data with the highest fusion score can be selected as the language corresponding to the target speech signal. Wherein, the fusion method may be a linear weighting method, which is not specifically limited in this embodiment of the present invention.

在确定识别文本数据对应的源语种类型后，可按照预设对应关系确定源语种类型对应的目标语种类型。其中，源语种类型即为讲话用户所使用的语种，目标语种类型一般为当前用户交流涉及的语种中除源语种类型之外的其它语种类型。其中，目标语种类型可以为一种或多种，分别对应使用两个语种的用户和使用多个语种的用户进行沟通交流。After determining the source language type corresponding to the recognized text data, the target language type corresponding to the source language type can be determined according to the preset correspondence relationship. Wherein, the source language type is the language used by the speaking user, and the target language type generally refers to other language types except the source language type among the languages involved in the current user communication. Wherein, the type of the target language may be one or more types, which respectively correspond to communication between users who use two languages and users who use multiple languages.

例如，若当前用户交流时涉及的语种为中文、英语、法语、泰语、印地语、德语等，则可以预先确定中文语音信号每次翻译成英语、法语两种语言，也就是将英语和法语作为目标语种类型。For example, if the languages involved in the current user communication are Chinese, English, French, Thai, Hindi, German, etc., it can be pre-determined that the Chinese voice signal is translated into English and French each time, that is, English and French as the target language type.

除了预先设置好固定的目标语种类型之外，还可以基于讲话用户的偏好或者需求来确定源语种类型对应的目标语种类型。相应地，作为一种可选实施例，本发明实施例不对按照预设对应关系确定源语种类型对应的目标语种类型的方式作具体限定，包括但不限于：基于目标语音信号中的声纹特征，确定声纹特征对应的预设语种类型，以作为源语种类型对应的目标语种类型。In addition to setting a fixed target language type in advance, the target language type corresponding to the source language type can also be determined based on the preference or demand of the speaking user. Correspondingly, as an optional embodiment, the embodiment of the present invention does not specifically limit the method of determining the target language type corresponding to the source language type according to the preset correspondence relationship, including but not limited to: based on the voiceprint features in the target voice signal , determining the preset language type corresponding to the voiceprint feature as the target language type corresponding to the source language type.

例如，若开会中领导讲话需要自动翻译成英语，则可预先将领导讲话时的目标语种类型确定为英语，并与领导讲话时的声纹特征之间建立对应关系，从而在提取到开会中领导讲话的声纹特征后，可直接确定目标语种类型为英语。For example, if the leader's speech in a meeting needs to be automatically translated into English, the target language type of the leader's speech can be determined in advance as English, and a corresponding relationship with the voiceprint features of the leader's speech can be established, so that the leader can be extracted in the meeting. After the voiceprint features of the speech, it can be directly determined that the target language type is English.

在上述步骤2012中，在将目标语种类型及识别文本数据输入至翻译编解码模型，输出目标文本数据时，可以采用基于神经网络的翻译编解码模型，将识别文本数据翻译成相应的目标语种文本数据，本发明实施例对此不作具体限定。其中，每个语种类型对应的翻译编解码模型可以预先使用大量训练数据构建得到。In the above step 2012, when inputting the target language type and the recognized text data into the translation codec model and outputting the target text data, a neural network-based translation codec model can be used to translate the recognized text data into the corresponding target language text data, which is not specifically limited in this embodiment of the present invention. Among them, the translation codec model corresponding to each language type can be constructed in advance using a large amount of training data.

除了上述通过翻译编解码模型对识别文本数据进行翻译之外，考虑到讲话时上下文内容之间的关联性，本发明实施例还提供了一种获取目标文本数据的方法，包括但不限于：若判断获知目标语音信号与第一合成语音信号分别传递的信息相互关联，则基于当前轮次之前轮次的语音信号和/或翻译结果，对识别文本数据进行翻译，以得到目标文本数据。In addition to translating the recognition text data by translating the codec model, the embodiment of the present invention also provides a method for obtaining the target text data in consideration of the correlation between the context and content of the speech, including but not limited to: if If it is determined that the information conveyed by the target speech signal and the first synthesized speech signal are related to each other, the recognized text data is translated based on the speech signals and/or translation results of the round before the current round to obtain the target text data.

具体地，可判断当前轮次的目标语音信号与上一轮次的第一合成语音信号分别传递的信息是否相互关联，若两者相关(如在播报过程中，讲话的用户打断播报需要补充)，则在翻译时可以将当前轮次的目标语音信号与上一轮次已翻译过的语音信号进行融合后再翻译，最后再将翻译结果播报出来；也可以在翻译后，将两轮语音信号的翻译结果进行融合后，再播报出来；还可以结合上一轮次已翻译过的语音信号、上一轮次的翻译结果、目标语音信号以及目标语音信号的翻译结果，将两轮翻译结果进行融合后再播报出来；另外，上述对当前轮次的识别文本数据进行翻译的过程中，除了将上一轮次的语音信号和/或翻译结果作为参考之外，还可以将前n轮次的语音信号和/或翻译结果作为参考，本发明实施例对此不作具体限定。其中，n大于等于1。Specifically, it can be determined whether the information conveyed by the target voice signal of the current round and the first synthesized voice signal of the previous round are related to each other, if the two are related (such as during the broadcast process, the speaking user interrupts the broadcast and needs to supplement ), then during translation, the target speech signal of the current round can be fused with the speech signal that has been translated in the previous round before translation, and finally the translation result will be broadcast; or after translation, the two rounds of speech After the translation results of the signals are fused, they are broadcasted; it is also possible to combine the translated speech signals of the previous round, the translation results of the previous round, the target speech signal, and the translation results of the target speech signal to combine the translation results of the two rounds In addition, in the above-mentioned process of translating the recognized text data of the current round, in addition to using the voice signal and/or translation results of the previous round as a reference, the previous n rounds can also be used as a reference. The speech signal and/or the translation result are used as a reference, which is not specifically limited in this embodiment of the present invention. Wherein, n is greater than or equal to 1.

在获取到目标文本数据后，可对目标文本数据进行语音合成，得到第二合成语音信号。相应地，作为一种可选实施例，关于对目标文本数据进行语音合成，得到第二合成语音信号的方式，本发明实施例对此不作具体限定，包括但不限于：获取语音播报参数，将目标文本数据及语音播报参数输入至语音合成模型，输出第二合成语音信号，语音播报参数中至少包括播报第二合成语音信号时所使用的音色参数。After the target text data is acquired, speech synthesis may be performed on the target text data to obtain a second synthesized speech signal. Correspondingly, as an optional embodiment, the embodiment of the present invention does not specifically limit the method of performing speech synthesis on the target text data to obtain the second synthesized speech signal, including but not limited to: obtaining speech broadcast parameters, The target text data and speech broadcast parameters are input to the speech synthesis model, and the second synthesized speech signal is output, and the speech broadcast parameters include at least the timbre parameters used when broadcasting the second synthesized speech signal.

其中，在进行语音合成时，可以选择一个固定发音人模型进行合成，如可以使用一个声音中性、浑厚的声音的合成模型合成相应的第二合成语音信号。当然，还可以选择个性化的发音人模型进行合成。具体地，语音翻译系统中可包含多种不同音色的声音，用户可以自己选择，也可以由系统根据当前用户的用户信息来进行选择，本发明实施例对此不作具体限定。其中，用户信息包括但不限于用户的性别、年龄、音色等。例如，若听播报的用户为男性，系统可自动选择女性发音人模型，以合成女性发声的第二合成语音信号。Wherein, when performing speech synthesis, a fixed speaker model can be selected for synthesis, for example, a synthesis model with a neutral and thick voice can be used to synthesize the corresponding second synthesized speech signal. Of course, you can also choose a personalized speaker model for synthesis. Specifically, the voice translation system may include a variety of voices with different timbres, which can be selected by the user or selected by the system based on the user information of the current user, which is not specifically limited in this embodiment of the present invention. Wherein, the user information includes but not limited to the user's gender, age, timbre and so on. For example, if the user listening to the broadcast is a male, the system can automatically select a female speaker model to synthesize a second synthesized voice signal uttered by a female.

当然，还可以利用声音转换，将合成的声音转换成与用户音色相近的声音进行播报。例如，在对用户A输入的待翻译的语音信号进行翻译后，利用声音转换，使得转换后的语音信号的音色与用户A音色相近，将转换后的语音信号播报出来。Of course, voice conversion can also be used to convert the synthesized voice into a voice similar to the user's timbre for broadcasting. For example, after translating the speech signal input by user A to be translated, voice conversion is used to make the timbre of the converted speech signal similar to that of user A, and the converted speech signal is broadcast.

另外，由于基于当前轮次之前轮次的语音信号和/或翻译结果，对识别文本数据进行翻译，从而可充分利用讲话时上下文的关联性，提高翻译准确率。In addition, since the recognized text data is translated based on the speech signals and/or translation results of the rounds before the current round, the relevance of the context of speech can be fully utilized to improve translation accuracy.

最后，由于可按照预设对应关系确定源语种类型对应的目标语种类型，如可基于目标语音信号中的声纹特征，确定目标语种类型，从而可满足用户在翻译时的偏好和需求，实现个性化定制。Finally, since the target language type corresponding to the source language type can be determined according to the preset corresponding relationship, for example, the target language type can be determined based on the voiceprint characteristics in the target voice signal, so as to meet the user's preferences and needs when translating, and realize personalized translation. customization.

需要说明的是，上述所有可选实施例，可以采用任意结合形成本发明的可选实施例，在此不再一一赘述。It should be noted that all the above optional embodiments may be combined in any way to form optional embodiments of the present invention, which will not be repeated here.

基于上述实施例的内容，本发明实施例提供了一种语音翻译处理装置，该语音翻译处理装置用于执行上述方法实施例中的语音翻译处理方法。参见图4，该装置包括：Based on the content of the foregoing embodiments, an embodiment of the present invention provides a speech translation processing device, the speech translation processing device is configured to execute the speech translation processing method in the foregoing method embodiments. Referring to Figure 4, the device includes:

停止播报模块401，用于在对第一合成语音信号进行播报的过程中，若接收到包含有部分第一合成语音信号的混合语音信号，则停止播报第一合成语音信号，第一合成语音信号为经由上一轮次翻译以及语音合成后所得到的；Stop broadcasting module 401, for broadcasting the first synthesized voice signal, if a mixed voice signal containing part of the first synthesized voice signal is received, stop broadcasting the first synthesized voice signal, the first synthesized voice signal It is obtained after the last round of translation and speech synthesis;

过滤模块402，用于从混合语音信号过滤掉部分第一合成语音信号，得到当前轮次待翻译的语音信号，并作为目标语音信号；The filtering module 402 is used to filter out part of the first synthesized speech signal from the mixed speech signal, obtain the speech signal to be translated in the current round, and use it as the target speech signal;

播报模块403，用于基于目标语音信号，获取第二合成语音信号，并播报第二合成语音信号，第二合成语音信号是对目标语音信号进行翻译以及语音合成后所得到的。The broadcasting module 403 is configured to acquire a second synthesized voice signal based on the target voice signal, and broadcast the second synthesized voice signal. The second synthesized voice signal is obtained by translating and speech-synthesizing the target voice signal.

作为一种可选实施例，第一合成语音信号与目标语音信号为相同的语种类型或者第一合成语音信号与目标语音信号为不同的语种类型。As an optional embodiment, the first synthesized speech signal and the target speech signal are of the same language type, or the first synthesized speech signal and the target speech signal are of different language types.

作为一种可选实施例，该装置还包括：As an optional embodiment, the device also includes:

第一获取模块，用于获取对目标语音信号进行语音识别后得到的识别文本数据；The first obtaining module is used to obtain the recognized text data obtained after performing speech recognition on the target speech signal;

第二获取模块，用于获取对识别文本数据进行翻译后得到的目标文本数据；The second acquiring module is used to acquire target text data obtained after translating the recognized text data;

语音合成模块，用于对目标文本数据进行语音合成，得到第二合成语音信号。The speech synthesis module is configured to perform speech synthesis on the target text data to obtain a second synthesized speech signal.

作为一种可选实施例，第二获取模块，包括：As an optional embodiment, the second acquisition module includes:

确定单元，用于确定识别文本数据对应的源语种类型，并按照预设对应关系确定源语种类型对应的目标语种类型；A determining unit, configured to determine the source language type corresponding to the recognized text data, and determine the target language type corresponding to the source language type according to the preset correspondence relationship;

翻译单元，用于将目标语种类型及识别文本数据输入至翻译编解码模型，输出目标文本数据。The translation unit is used to input the target language type and recognized text data into the translation codec model, and output the target text data.

作为一种可选实施例，确定单元，用于基于目标语音信号中的声纹特征，确定声纹特征对应的预设语种类型，以作为源语种类型对应的目标语种类型。As an optional embodiment, the determining unit is configured to determine, based on the voiceprint features in the target voice signal, a preset language type corresponding to the voiceprint features as a target language type corresponding to the source language type.

作为一种可选实施例，第二获取模块，用于当判断获知目标语音信号与第一合成语音信号分别传递的信息相互关联时，则基于当前轮次之前轮次的语音信号和/或翻译结果，对识别文本数据进行翻译，以得到目标文本数据。As an optional embodiment, the second acquisition module is configured to, when it is determined that the information conveyed by the target speech signal and the first synthesized speech signal are related to each other, based on the speech signal and/or translation of the previous round of the current round As a result, the recognized text data is translated to obtain target text data.

作为一种可选实施例，语音合成模块，用于获取语音播报参数，将目标文本数据及语音播报参数输入至语音合成模型，输出第二合成语音信号，语音播报参数中至少包括播报第二合成语音信号时所使用的音色参数。As an optional embodiment, the speech synthesis module is used to obtain speech broadcast parameters, input target text data and speech broadcast parameters to the speech synthesis model, and output the second synthesized speech signal, and the speech broadcast parameters include at least the second synthesized speech signal. Timbre parameters used for voice signals.

本发明实施例提供的装置，通过在对第一合成语音信号进行播报的过程中，若接收到包含有部分第一合成语音信号的混合语音信号，则停止播报第一合成语音信号。从混合语音信号过滤掉部分第一合成语音信号，得到当前轮次待翻译的语音信号，并作为目标语音信号。基于目标语音信号，获取第二合成语音信号，并播报第二合成语音信号。由于沟通过程中的任意一方，均可按照全双工模式随时打断播报过程，而不用每次都等到一轮播报过程结束，从而在提高沟通效率的同时，还可使得不同语种用户之间沟通更加自然流畅。The device provided by the embodiment of the present invention stops broadcasting the first synthesized voice signal if a mixed voice signal including part of the first synthesized voice signal is received during broadcasting of the first synthesized voice signal. Part of the first synthesized speech signal is filtered out from the mixed speech signal to obtain the speech signal to be translated in the current round as the target speech signal. Obtain a second synthesized voice signal based on the target voice signal, and broadcast the second synthesized voice signal. Since any party in the communication process can interrupt the broadcast process at any time according to the full-duplex mode, instead of waiting for a round of broadcast process to end every time, it can improve communication efficiency and enable users of different languages to communicate More natural flow.

本发明实施例提供了一种语音翻译处理设备。参见图5，该设备包括：处理器(processor)501、存储器(memory)502和总线503；An embodiment of the present invention provides a speech translation processing device. Referring to FIG. 5, the device includes: a processor (processor) 501, a memory (memory) 502 and a bus 503;

其中，处理器501及存储器502分别通过总线503完成相互间的通信；Wherein, the processor 501 and the memory 502 complete the mutual communication through the bus 503 respectively;

处理器501用于调用存储器502中的程序指令，以执行上述实施例所提供的语音翻译处理方法，例如包括：在对第一合成语音信号进行播报的过程中，若接收到包含有部分第一合成语音信号的混合语音信号，则停止播报第一合成语音信号，第一合成语音信号为经由上一轮次翻译以及语音合成后所得到的；从混合语音信号过滤掉部分第一合成语音信号，得到当前轮次待翻译的语音信号，并作为目标语音信号；基于目标语音信号，获取第二合成语音信号，并播报第二合成语音信号，第二合成语音信号是对目标语音信号进行翻译以及语音合成后所得到的。The processor 501 is used to call the program instructions in the memory 502 to execute the speech translation processing method provided by the above-mentioned embodiments, for example, including: in the process of broadcasting the first synthesized speech signal, if a part of the first The mixed voice signal of the synthesized voice signal, then stop broadcasting the first synthesized voice signal, the first synthesized voice signal is obtained after the last round of translation and speech synthesis; filter out part of the first synthesized voice signal from the mixed voice signal, Obtain the speech signal to be translated in the current round, and use it as the target speech signal; based on the target speech signal, obtain the second synthesized speech signal, and broadcast the second synthesized speech signal, the second synthesized speech signal is to translate the target speech signal and voice obtained after synthesis.

本发明实施例提供一种非暂态计算机可读存储介质，该非暂态计算机可读存储介质存储计算机指令，该计算机指令使计算机执行上述实施例所提供的语音翻译处理方法，例如包括：在对第一合成语音信号进行播报的过程中，若接收到包含有部分第一合成语音信号的混合语音信号，则停止播报第一合成语音信号，第一合成语音信号为经由上一轮次翻译以及语音合成后所得到的；从混合语音信号过滤掉部分第一合成语音信号，得到当前轮次待翻译的语音信号，并作为目标语音信号；基于目标语音信号，获取第二合成语音信号，并播报第二合成语音信号，第二合成语音信号是对目标语音信号进行翻译以及语音合成后所得到的。An embodiment of the present invention provides a non-transitory computer-readable storage medium, the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions cause the computer to execute the speech translation processing method provided in the above-mentioned embodiments, for example, including: In the process of broadcasting the first synthesized voice signal, if a mixed voice signal containing part of the first synthesized voice signal is received, then stop broadcasting the first synthesized voice signal, the first synthesized voice signal is translated and Obtained after speech synthesis; filter out part of the first synthesized speech signal from the mixed speech signal, obtain the speech signal to be translated in the current round, and use it as the target speech signal; obtain the second synthesized speech signal based on the target speech signal, and broadcast The second synthesized speech signal, which is obtained by translating and speech-synthesizing the target speech signal.

本领域普通技术人员可以理解：实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成，前述的程序可以存储于一计算机可读取存储介质中，该程序在执行时，执行包括上述方法实施例的步骤；而前述的存储介质包括：ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the steps for realizing the above-mentioned method embodiments can be completed by hardware related to program instructions, and the aforementioned program can be stored in a computer-readable storage medium. When the program is executed, the It includes the steps of the above method embodiments; and the aforementioned storage medium includes: ROM, RAM, magnetic disk or optical disk and other various media that can store program codes.

以上所描述的信息交互设备等实施例仅仅是示意性的，其中作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The above-described embodiments such as information interaction equipment are only illustrative, and the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may Located in one place, or can be distributed to multiple network elements. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. It can be understood and implemented by those skilled in the art without any creative effort.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分方法。Through the above description of the implementations, those skilled in the art can clearly understand that each implementation can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware. Based on this understanding, the essence of the above technical solution or the part that contributes to the prior art can be embodied in the form of software products, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic Discs, optical discs, etc., include several instructions to make a computer device (which may be a personal computer, server, or network device, etc.) execute various embodiments or some partial methods of the embodiments.

最后，本申请的方法仅为较佳的实施方案，并非用于限定本发明实施例的保护范围。凡在本发明实施例的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明实施例的保护范围之内。Finally, the method of the present application is only a preferred implementation, and is not intended to limit the scope of protection of the examples of the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principle of the embodiments of the present invention shall be included in the protection scope of the embodiments of the present invention.

Claims

1. A voice translation processing method, characterized in that, comprising:

In the process of broadcasting the first synthesized voice signal, if a mixed voice signal containing part of the first synthesized voice signal is received, the broadcast of the first synthesized voice signal is stopped, and the first synthesized voice signal is Obtained after the last round of translation and speech synthesis;

Filtering out part of the first synthesized speech signal from the mixed speech signal to obtain the speech signal to be translated in the current round as the target speech signal;

Obtaining a second synthesized voice signal based on the target voice signal, and broadcasting the second synthesized voice signal, where the second synthesized voice signal is obtained by translating and speech-synthesizing the target voice signal.

2. The method according to claim 1, wherein the first synthesized speech signal and the target speech signal are of the same language type or the first synthesized speech signal is different from the target speech signal Language type.

3. The method according to claim 1, wherein, before the broadcasting of the second synthesized speech signal, further comprising:

Acquiring recognized text data obtained after performing speech recognition on the target speech signal, obtaining target text data obtained after translating the recognized text data, and performing speech synthesis on the target text data to obtain the second synthesized voice signal.

4. The method according to claim 3, wherein said acquiring the target text data obtained after translating said recognition text data comprises:

Determine the source language type corresponding to the identified text data, and determine the target language type corresponding to the source language type according to the preset correspondence relationship;

Input the target language type and the recognized text data into the translation codec model, and output the target text data.

5. The method according to claim 4, wherein said determining the target language type corresponding to the source language type according to the preset corresponding relationship comprises:

Based on the voiceprint feature in the target voice signal, determine a preset language type corresponding to the voiceprint feature as the target language type corresponding to the source language type.

6. The method according to claim 3, wherein said acquiring the target text data obtained after translating said recognition text data comprises:

If it is determined that the target speech signal and the information conveyed by the first synthesized speech signal are related to each other, then based on the speech signal and/or the translation result of the round before the current round, the recognized text data is translated to obtain The target text data is obtained.

7. The method according to claim 3, wherein said performing speech synthesis on said target text data to obtain said second synthesized speech signal comprises:

Acquiring speech broadcast parameters, inputting the target text data and the speech broadcast parameters into the speech synthesis model, and outputting the second synthesized speech signal, the speech broadcast parameters at least include the The tone parameters to use.

8. A speech translation processing device, characterized in that, comprising:

stop broadcasting module, for broadcasting the first synthesized voice signal, if a mixed voice signal containing part of the first synthesized voice signal is received, then stop broadcasting the first synthesized voice signal, the The first synthesized speech signal is obtained after a previous round of translation and speech synthesis;

A filtering module, configured to filter out part of the first synthesized speech signal from the mixed speech signal, to obtain the speech signal to be translated in the current round, and use it as the target speech signal;

The broadcast module is used to obtain a second synthesized voice signal based on the target voice signal, and broadcast the second synthesized voice signal, the second synthesized voice signal is obtained after the target voice signal is translated and speech synthesized owned.

9. A speech translation processing device, characterized in that, comprising:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

The memory stores program instructions executable by the processor, and the processor can execute the method according to any one of claims 1 to 7 by calling the program instructions.

10. A non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions cause the computer to execute the computer according to any one of claims 1 to 7. Methods.