CN115312078A

CN115312078A - Method, apparatus, device and storage medium for determining quality of voice data

Info

Publication number: CN115312078A
Application number: CN202210939917.2A
Authority: CN
Inventors: 田霄海; 付凯奇; 高绍钧; 顾怡炜; 王凯; 李伟; 马泽君
Original assignee: Beijing Youzhuju Network Technology Co Ltd; Lemon Inc Cayman Island
Current assignee: Beijing Youzhuju Network Technology Co Ltd; Lemon Inc Cayman Island
Priority date: 2022-08-05
Filing date: 2022-08-05
Publication date: 2022-11-08
Anticipated expiration: 2042-08-05
Also published as: CN115312078B

Abstract

Embodiments of the present disclosure relate to a method, apparatus, device, and storage medium for determining the quality of voice data. The method includes determining characteristics of the acquired speech data. The method also includes obtaining, based on the feature, a first quality level and a second quality level for the speech data, the first quality level being related to the first language, and the second quality level being related to the second language. The method also includes determining a target quality level for the speech data based on the first quality level and the second quality level, the target quality level indicating the speech quality of the speech data. Through the method, the accuracy and efficiency of evaluating the quality of speech data across fields can be improved, the utilization rate of data is improved, and the user experience is improved.

Description

Method, device, device and storage medium for determining the quality of speech data

技术领域technical field

本公开的实施例总体涉及语音数据处理领域，具体涉及用于确定语音数据的质量的方法、装置、设备和存储介质。Embodiments of the present disclosure generally relate to the field of speech data processing, and specifically relate to a method, device, device and storage medium for determining the quality of speech data.

背景技术Background technique

随着计算机技术的发展，语音处理的水平也在快速的改进。利用计算设备合成语音数据或对语音数据进行转换也越来越多的用于各种设备和应用。对于这些语音数据，可以通过语音质量进行评价，因为语音质量是反映通过语音合成、语音转换等系统性能的主要指标。平均意见得分(Mean Opinion Score，MOS)则是标注人员对合成音频进行听力测试后，对该音频的语音质量进行的主观评价分数。由于传统的MOS打分需要大量的标注人员进行参与，这一主观评价过程会导致高额的费用和过长的耗时。因此，在语音数据的处理过程中还存在许多需要解决的问题。With the development of computer technology, the level of speech processing is also improving rapidly. The use of computing devices to synthesize or convert speech data is also increasingly used in various devices and applications. For these voice data, the voice quality can be used for evaluation, because the voice quality is the main index reflecting the system performance through speech synthesis, speech conversion, etc. The Mean Opinion Score (MOS) is the subjective evaluation score of the voice quality of the audio after the annotator conducts a listening test on the synthesized audio. Since the traditional MOS scoring requires the participation of a large number of annotators, this subjective evaluation process will lead to high costs and long time consumption. Therefore, there are still many problems to be solved in the processing of voice data.

发明内容Contents of the invention

本公开的实施例提供了一种用于确定语音数据的质量的方法、装置、设备和存储介质。Embodiments of the present disclosure provide a method, device, device and storage medium for determining the quality of voice data.

根据本公开的第一方面，提供了一种确定语音数据的质量的方法。该方法包括确定所获取的语音数据的特征。该方法还包括基于特征，获取针对语音数据的第一质量等级和第二质量等级，第一质量等级与第一语言有关，第二质量等级与第二语言有关。该方法还包括基于第一质量等级和第二质量等级，确定针对语音数据的目标质量等级，目标质量等级指示语音数据的语音质量。According to a first aspect of the present disclosure, a method of determining the quality of speech data is provided. The method includes determining features of the acquired speech data. The method also includes obtaining, based on the feature, a first quality level for the speech data and a second quality level, the first quality level being related to the first language and the second quality level being related to the second language. The method also includes determining a target quality level for the speech data based on the first quality level and the second quality level, the target quality level indicating a speech quality of the speech data.

在本公开的第二方面中，提供了一种用于确定语音数据的质量的装置。该装置包括特征确定模块，被配置为确定所获取的语音数据的特征；质量等级获取模块，被配置为基于特征，获取针对语音数据的第一质量等级和第二质量等级，第一质量等级与第一语言有关，第二质量等级与第二语言有关；以及目标质量等级确定模块，被配置为基于第一质量等级和第二质量等级，确定针对语音数据的目标质量等级，目标质量等级指示语音数据的语音质量。In a second aspect of the present disclosure, an apparatus for determining quality of speech data is provided. The device includes a feature determination module configured to determine features of the acquired voice data; a quality level acquisition module configured to acquire a first quality level and a second quality level for the voice data based on the features, the first quality level and the second quality level The first language is related, the second quality level is related to the second language; and the target quality level determination module is configured to determine the target quality level for the voice data based on the first quality level and the second quality level, the target quality level indicates the voice Data voice quality.

在本公开的第三方面中，提供了一种电子设备，包括至少一个处理器；以及存储装置，用于存储至少一个程序，当至少一个程序被至少一个处理器执行，使得至少一个处理器实现根据本公开的第一方面的方法。In a third aspect of the present disclosure, an electronic device is provided, including at least one processor; and a storage device for storing at least one program, when the at least one program is executed by the at least one processor, the at least one processor realizes A method according to the first aspect of the present disclosure.

在本公开的第四方面中，提供了一种计算机可读存储介质，其上存储有计算机程序，该程序被处理器执行时实现根据本公开的第一方面的方法。In a fourth aspect of the present disclosure, there is provided a computer-readable storage medium on which a computer program is stored, the program implements the method according to the first aspect of the present disclosure when executed by a processor.

应当理解，该内容部分中所描述的内容并非旨在限定本公开的实施例的关键或重要特征，亦非用于限制本公开的范围。本公开的其它特征将通过以下的描述变得容易理解。It should be understood that the content described in this summary is not intended to limit the key or important features of the embodiments of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will be readily understood through the following description.

附图说明Description of drawings

通过结合附图对本公开示例性实施例进行更详细的描述，本公开的上述以及其它目的、特征和优势将变得更加明显，其中，在本公开示例性实施例中，相同的参考标号通常代表相同部件。The above and other objects, features and advantages of the present disclosure will become more apparent by describing the exemplary embodiments of the present disclosure in more detail with reference to the accompanying drawings, wherein, in the exemplary embodiments of the present disclosure, the same reference numerals generally represent same parts.

图1图示了本公开的实施例的设备和/或方法可以在其中被实施的示例环境100的示意图；FIG. 1 illustrates a schematic diagram of an example environment 100 in which devices and/or methods of embodiments of the present disclosure may be implemented;

图2图示了根据本公开的实施例的用于确定语音数据的质量的过程200的流程图；FIG. 2 illustrates a flowchart of a process 200 for determining the quality of speech data according to an embodiment of the present disclosure;

图3图示了根据本公开的实施例的用于生成语音数据的分数的示例300的示意图；FIG. 3 illustrates a schematic diagram of an example 300 for generating a score for speech data according to an embodiment of the present disclosure;

图4图示了根据本公开的实施例的用于训练解码器和语音表示模型的过程400的示意图；FIG. 4 illustrates a schematic diagram of a process 400 for training a decoder and a speech representation model according to an embodiment of the present disclosure;

图5图示了根据本公开的实施例的训练解码器和语音表示模型的示例500的示意图；FIG. 5 illustrates a schematic diagram of an example 500 of training a decoder and speech representation model according to an embodiment of the present disclosure;

图6图示了根据本公开实施例的用于确定语音数据的质量的装置600的示意性框图；Fig. 6 illustrates a schematic block diagram of an apparatus 600 for determining the quality of speech data according to an embodiment of the present disclosure;

图7图示了适于用来实施本公开内容的实施例的示例设备700的示意性框图。FIG. 7 illustrates a schematic block diagram of an example device 700 suitable for implementing embodiments of the present disclosure.

在各个附图中，相同或对应的标号表示相同或对应的部分。In the respective drawings, the same or corresponding reference numerals denote the same or corresponding parts.

具体实施方式Detailed ways

可以理解的是，在使用本公开各实施例公开的技术方案之前，均应当依据相关法律法规通过恰当的方式对本公开所涉及个人信息的类型、使用范围、使用场景等告知用户并获得用户的授权。It can be understood that before using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed of the type, scope of use, and use scenarios of the personal information involved in the present disclosure in an appropriate manner in accordance with relevant laws and regulations, and the authorization of the user should be obtained. .

例如，在响应于接收到用户的主动请求时，向用户发送提示信息，以明确地提示用户，其请求执行的操作将需要获取和使用到用户的个人信息。从而，使得用户可以根据提示信息来自主地选择是否向执行本公开技术方案的操作的电子设备、应用程序、服务器或存储介质等软件或硬件提供个人信息。For example, in response to receiving the user's active request, send prompt information to the user to clearly remind the user that the requested operation will require the acquisition and use of the user's personal information. Thus, the user can independently choose whether to provide personal information to software or hardware such as electronic devices, application programs, servers, or storage media that perform the operations of the technical solution of the present disclosure according to the prompt information.

作为一种可选的但非限定性的实现方式，响应于接收到用户的主动请求，向用户发送提示信息的方式例如可以是弹窗的方式，弹窗中可以以文字的方式呈现提示信息。此外，弹窗中还可以承载供用户选择“同意”或者“不同意”向电子设备提供个人信息的选择控件。As an optional but non-limiting implementation, in response to receiving the active request from the user, the prompt information may be sent to the user, for example, in the form of a pop-up window, and the prompt information may be presented in text in the pop-up window. In addition, the pop-up window may also carry a selection control for the user to choose "agree" or "disagree" to provide personal information to the electronic device.

可以理解的是，上述通知和获取用户授权过程仅是示意性的，不对本公开的实现方式构成限定，其它满足相关法律法规的方式也可应用于本公开的实现方式中。It can be understood that the above process of notifying and obtaining user authorization is only illustrative and does not limit the implementation of the present disclosure. Other methods that meet relevant laws and regulations may also be applied to the implementation of the present disclosure.

可以理解的是，本技术方案所涉及的数据(包括但不限于数据本身、数据的获取或使用)应当遵循相应法律法规及相关规定的要求。It can be understood that the data involved in this technical solution (including but not limited to the data itself, the acquisition or use of data) should comply with the requirements of corresponding laws and regulations and relevant regulations.

下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例，然而应当理解的是，本公开可以通过各种形式来实现，而且不应该被解释为限于这里阐述的实施例，相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是，本公开的附图及实施例仅用于示例性作用，并非用于限制本公开的保护范围。Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the present disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein; A more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for exemplary purposes only, and are not intended to limit the protection scope of the present disclosure.

在本公开的实施例的描述中，术语“包括”及其类似用语应当理解为开放性包含，即“包括但不限于”。术语“基于”应当理解为“至少部分地基于”。术语“一个实施例”或“该实施例”应当理解为“至少一个实施例”。术语“第一”、“第二”等等可以指代不同的或相同的对象。下文还可能包括其他明确的和隐含的定义。In the description of the embodiments of the present disclosure, the term "comprising" and its similar expressions should be interpreted as an open inclusion, that is, "including but not limited to". The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be read as "at least one embodiment". The terms "first", "second", etc. may refer to different or the same object. Other definitions, both express and implied, may also be included below.

如上所述，利用标注人员对语音数据的语音质量进行评分会导致高额的费用和过长的耗时。因此，提出了MOS自动打分系统，主要是利用机器对合成音频打分，替换掉标注人员的主观评价，从而达到节省时间和资源的目的。然而，利用机器进行MOS打分至少存在两个挑战，第一个挑战是数据稀疏的问题，用来训练MOS打分系统的数据并不是很多，这会限制打分系统的性能。第二个挑战则是针对跨领域合成音频(不同语种的合成音频)的打分，各种语种的合成音频在MOS打分系统中可能由于缺少对应语种的训练数据，打分系统无法给出一个准确的分数。As mentioned above, using annotators to score the speech quality of speech data would result in high cost and time-consuming. Therefore, the MOS automatic scoring system is proposed, which mainly uses the machine to score the synthesized audio, replacing the subjective evaluation of the annotator, so as to save time and resources. However, there are at least two challenges in using machines for MOS scoring. The first challenge is the problem of data sparsity. There is not a lot of data used to train the MOS scoring system, which will limit the performance of the scoring system. The second challenge is the scoring of cross-domain synthetic audio (synthetic audio in different languages). In the MOS scoring system, the synthetic audio of various languages may not be able to give an accurate score due to the lack of training data of the corresponding language. .

在一些传统方案中设计了自动打分系统。该系统利用提前训练的编码器进行MOS自动打分。该方案在相同领域(训练集和测试集的语种一致)的数据上可以取得较好的打分结果。Automatic scoring systems are designed in some traditional schemes. The system uses pre-trained encoders for automatic MOS scoring. This scheme can achieve better scoring results on data in the same domain (the language of the training set and the test set are the same).

然而，该传统方案在跨领域任务中，存在着许多问题。例如，当不同领域的训练集分开使用时，跨领域的训练集非常稀少，会限制编码器的性能；当不同领域的训练集混合使用时,由于部分训练集和测试集的语种背景不同，系统无法很好的适配测试集的合成音频。从而导致MOS系统打分不准的问题。However, this traditional scheme has many problems in cross-domain tasks. For example, when the training sets of different domains are used separately, the cross-domain training sets are very rare, which will limit the performance of the encoder; when the training sets of different domains are mixed, due to the different language backgrounds of some training sets and test sets, the system Synthetic audio that does not fit well to the test set. This leads to the problem of inaccurate scoring by the MOS system.

另外，这些传统方案并没有考虑不同任务之间的联系，比如自动语音识别任务和MOS自动打分的关联。MOS打分数据稀疏，但自动语音识别的数据却是海量的。In addition, these traditional solutions do not consider the connection between different tasks, such as the association between automatic speech recognition tasks and MOS automatic scoring. MOS scoring data is sparse, but the data of automatic speech recognition is massive.

至少为了解决上述和其他潜在问题，本公开的实施例提出了一种确定语音数据的质量的方法。在该方法中，计算设备确定所获取的语音数据的特征。然后利用获得的特征，获取针对语音数据的第一质量等级和第二质量等级。这些质量等级与语言的种类有关。然后计算设备基于第一质量等级和第二质量等级，确定针对语音数据的目标质量等级。通过该方法，能够提高跨领域评价语音数据的质量的准确性和效率，并且提高了数据的利用率，改进了用户体验。To address at least the above and other potential problems, embodiments of the present disclosure propose a method of determining the quality of speech data. In the method, a computing device determines characteristics of acquired speech data. Then using the obtained features, a first quality level and a second quality level for the voice data are obtained. These quality levels are related to the kind of language. The computing device then determines a target quality level for the voice data based on the first quality level and the second quality level. Through the method, the accuracy and efficiency of cross-domain evaluation of the quality of voice data can be improved, the utilization rate of data can be improved, and user experience can be improved.

下面将进一步结合附图来详细描述本公开的实施例，其中图1示出了本公开的实施例的设备和/或方法可以在其中被实施的示例环境100。Embodiments of the present disclosure will be further described in detail below with reference to the accompanying drawings, wherein FIG. 1 shows an example environment 100 in which the devices and/or methods of the embodiments of the present disclosure can be implemented.

在环境100中包括计算设备104，计算设备104用于对语音数据102进行处理来确定语音数据102中的目标质量等级112。Included in the environment 100 is a computing device 104 for processing speech data 102 to determine a target quality level 112 in the speech data 102 .

计算设备104的示例包括但不限于个人计算机、服务器计算机、手持或膝上型设备、移动设备(诸如移动电话、个人数字助理(PDA)、媒体播放器等)、多处理器系统、消费电子产品、小型计算机、大型计算机、包括上述系统或设备中的任意一个的分布式计算环境等。Examples of computing device 104 include, but are not limited to, personal computers, server computers, handheld or laptop devices, mobile devices (such as mobile phones, personal digital assistants (PDAs), media players, etc.), multiprocessor systems, consumer electronics , a minicomputer, a mainframe computer, a distributed computing environment including any one of the above-mentioned systems or devices, and the like.

计算设备104接收的语音数据102是包括讲话者的语音数据。该语音数据包括但不限于合成的语音数据或转换的语言数据。对于合成的语音数据，其包括由计算设备合成的模拟人说话的语音。对于转换的语音数据，其包括将说话人说话的语音转换为另外一个说话人的语音或者将说话人说的内容转换成另一种语言。图1示出了计算设备104接收语音数据102。其仅是示例，而非对本公开的限定。计算设备104也可以生成语音数据或者语音数据存储在计算设备104的本地存储器中。Speech data 102 received by computing device 104 is speech data that includes a speaker. The speech data includes, but is not limited to, synthesized speech data or converted speech data. For synthesized speech data, it includes speech synthesized by a computing device simulating human speech. For converted voice data, it includes converting the voice spoken by a speaker into another speaker's voice or converting the content spoken by a speaker into another language. FIG. 1 shows computing device 104 receiving voice data 102 . It is only an example and does not limit the present disclosure. The computing device 104 may also generate speech data or store the speech data in local memory of the computing device 104 .

计算设备104在获得了语音数据102后，可以对语音数据102进行处理来获得语音数据的特征106。在一个示例，计算设备可以运行训练好的wav2vec2模型来提取语音数据的上下文表示作为特征106。在另一个示例中，计算设备运行其他的语音表示模型来获得语音数据的特征，诸如任意合适的将语音信息转换为向量表示的语音表示模型均可用于获取语音数据的特征。上述示例仅是用于描述本公开，而非对本公开的具体限定。本领域技术人员可以采用任意合适的模型来获得语音数据的特征。After the computing device 104 obtains the speech data 102 , it can process the speech data 102 to obtain the features 106 of the speech data. In one example, a computing device may run a trained wav2vec2 model to extract contextual representations of speech data as features 106 . In another example, the computing device runs other speech representation models to obtain the features of the speech data, such as any suitable speech representation model that converts speech information into a vector representation can be used to obtain the features of the speech data. The above examples are only used to describe the present disclosure, rather than to specifically limit the present disclosure. Those skilled in the art can use any suitable model to obtain the features of speech data.

计算设备104还可以基于特征106来获得针对该语音数据的第一质量等级和第二质量等级。其中第一质量等级与第一语言有关，第二质量等级与第二语言有关。在一个示例中，第一语言为英语，第二语言为中文。在另一个示例中，第一语言为中文，第二语言为英语。上述示例仅是用于描述本公开，而非对本公开的具体限定。本领域技术人员可以依据需要设置第一语言和第二语言具体为何种语言。在一些实施例中，第一质量等级是利用与第一语言有关的编码器生成的，第二质量等级是利用与第二语言有关的编码器生成的。Computing device 104 may also obtain the first quality level and the second quality level for the speech data based on characteristics 106 . Wherein the first quality level is related to the first language, and the second quality level is related to the second language. In one example, the first language is English and the second language is Chinese. In another example, the first language is Chinese and the second language is English. The above examples are only used to describe the present disclosure, rather than to specifically limit the present disclosure. Those skilled in the art can set the specific languages of the first language and the second language according to needs. In some embodiments, the first quality level is generated using an encoder associated with the first language and the second quality level is generated using an encoder associated with the second language.

图1中示出了计算设备104获得质量等级108和质量等级110，其仅是示例，而非对本公开的具体限定。计算设备104还可以依据需要获得针对任意数目语言的任意数目的质量等级。Computing device 104 is shown obtaining quality level 108 and quality level 110 in FIG. 1 , which is merely an example and not a specific limitation of the present disclosure. Computing device 104 may also obtain any number of quality levels for any number of languages as desired.

计算设备104从第一质量等级108和第二质量等级110中确定出语音数据102的目标质量等级以指示语音数据102的语音质量。在由计算设备104生成多于两个质量等级的示例中，计算设备104从多个质量等级中选取目标质量等级。备选地或附加地，计算设备104是利用语音数据的语言来选取与该语言相对应的质量等级作为目标质量等级。Computing device 104 determines a target quality level for voice data 102 from first quality level 108 and second quality level 110 to indicate the voice quality of voice data 102 . In examples where more than two quality levels are generated by computing device 104, computing device 104 selects a target quality level from among the plurality of quality levels. Alternatively or additionally, the computing device 104 uses the language of the speech data to select the quality level corresponding to the language as the target quality level.

在一个示例中，该目标质量等级为分数。在另一个示例中，该目标质量等级为不同的级别，如好、中、差或A、B、C、D。上述示例仅是用于描述本公开，而非对本公开的具体限定。本领域技术人员可以依据需要设置目标质量等级的内容。In one example, the target quality rating is a score. In another example, the target quality level is different levels, such as good, medium, poor or A, B, C, D. The above examples are only used to describe the present disclosure, rather than to specifically limit the present disclosure. Those skilled in the art can set the content of the target quality level according to needs.

通过该方法，能够提高跨领域评价语音数据的质量的准确性和效率，并且提高了数据的利用率，改进了用户体验。Through the method, the accuracy and efficiency of cross-domain evaluation of the quality of voice data can be improved, the utilization rate of data can be improved, and user experience can be improved.

上面结合图1描述了本公开的实施例能够在其中被实现的示例环境100的框图。下面结合图2描述根据本公开的实施例的用于确定语音数据的质量的过程200的流程图。过程200可以在图1的计算设备104处执行。A block diagram of an example environment 100 in which embodiments of the present disclosure can be implemented is described above in connection with FIG. 1 . A flow chart of a process 200 for determining the quality of speech data according to an embodiment of the present disclosure is described below with reference to FIG. 2 . Process 200 may be performed at computing device 104 of FIG. 1 .

在框202处，确定所获取的语音数据的特征。例如计算设备104获取语音数据，然后对语音数据进行处理来确定所获取的语音数据的特征。At block 202, characteristics of the acquired speech data are determined. For example, computing device 104 captures speech data and then processes the speech data to determine characteristics of the captured speech data.

在一些实施例中，计算设备104获取的语音数据可以为通过语音合成方式来生成的语音数据。例如，通过将文本转换为语音而生成的语音数据。在一些实施例中，计算设备104获取的语音数据是通过语音转换来生成。例如，将说话者说中文转换成说话者说英文的语音数据。上述示例仅是用于描述本公开，而非对本公开的具体限定。任意合适的语音数据均可用于确定语音的质量。通过该方式，可以快速确定合成语音或转换语音的质量。In some embodiments, the voice data acquired by the computing device 104 may be voice data generated through speech synthesis. For example, speech data generated by converting text to speech. In some embodiments, the voice data captured by computing device 104 is generated by voice conversion. For example, converting a speaker speaking Chinese into speech data of a speaker speaking English. The above examples are only used to describe the present disclosure, rather than to specifically limit the present disclosure. Any suitable speech data may be used to determine the quality of speech. In this way, the quality of the synthesized or transformed speech can be quickly determined.

在一些实施例中，语音数据的特征为获得的针对语音数据的向量。在一些实施例中，语音数据的特征为语音数据的上下文表示。在一些实施例中，计算设备104获取微调的语音表示模型。然后通过将经微调的语音表示模型应用于语音数据来获得特征。在一些实施例中，计算设备104将经调整的语音表示模型应用于语音数据来获得特征。该经调整的语音表示模型是利用样本语音数据及对应的质量等级在对语音数据评价模型整体进行训练时通过调整微调的语音表示模型而获得的。上述示例仅是用于描述本公开，而非对本公开的具体限定。本领域技术人员可以使用任意合适的模型的方法来获得语音数据的特征。备选地或附加地，上下文表示与对应于语音数据的文本有关。对于模型的训练将在下面结合图4和图5进一步描述。In some embodiments, the speech data is characterized by vectors obtained for the speech data. In some embodiments, the speech data is characterized as a contextual representation of the speech data. In some embodiments, computing device 104 obtains a fine-tuned speech representation model. Features are then obtained by applying the fine-tuned speech representation model to the speech data. In some embodiments, computing device 104 applies the adjusted speech representation model to the speech data to obtain features. The adjusted speech representation model is obtained by adjusting the fine-tuned speech representation model when training the speech data evaluation model as a whole by using sample speech data and corresponding quality levels. The above examples are only used to describe the present disclosure, rather than to specifically limit the present disclosure. Those skilled in the art can use any suitable model method to obtain the features of speech data. Alternatively or additionally, the contextual representation is related to the text corresponding to the speech data. The training of the model will be further described below with reference to FIG. 4 and FIG. 5 .

在框204处，基于特征，获取针对语音数据的第一质量等级和第二质量等级，第一质量等级与第一语言有关，第二质量等级与第二语言有关。例如计算设备104对该特征进行处理来获取针对语音数据102的第一质量等级108和第二质量等级110。At block 204, based on the characteristics, a first quality level is obtained for the speech data, the first quality level is related to the first language, and the second quality level is related to the second language. For example, computing device 104 processes the feature to obtain first quality level 108 and second quality level 110 for speech data 102 .

在一些实施例中，计算设备104通过将经训练的第一解码器应用于该特征来获得第一质量等级，并且通过将经训练的第二解码器应用于特征来获得第二质量等级。经训练的第一解码器和经训练的第二解码器可用于对语音数据的特征进行处理来获得对应的质量等级。第一解码器是利用与第一语言有关的语音数据训练得到的，第二解码器是利用与第二语音有关的语音数据训练得到的。通过该方式，可以实现利用不同的解码器对由语音数据确定质量等级，从而能够实现对各种语言的语音数据进行质量检测。In some embodiments, computing device 104 obtains the first quality level by applying a trained first decoder to the feature, and obtains the second quality level by applying a trained second decoder to the feature. The trained first decoder and the trained second decoder can be used to process the features of the speech data to obtain corresponding quality levels. The first decoder is trained using speech data related to the first language, and the second decoder is trained using speech data related to the second speech. In this way, different decoders can be used to determine the quality level of speech data, so that the quality detection of speech data in various languages can be realized.

备选地或附加地，在利用第一解码器或第二解码器获得质量等级时还需要输入用于提供质量评价的提供者的标识ID，以获得由该提供者给出的质量等级。在第一解码器或第二解码器的推理使用过程中，该提供者为虚拟打分员。对解码器的训练过程，下面将结合图4和5进一步描述。Alternatively or additionally, when using the first decoder or the second decoder to obtain the quality grade, it is also necessary to input the identification ID of the provider providing the quality evaluation, so as to obtain the quality grade given by the provider. During inferential use of either the first decoder or the second decoder, the provider is the virtual scorer. The training process of the decoder will be further described below in conjunction with FIGS. 4 and 5 .

备选地或附加地，计算设备104内还具有针对其他语言的解码器。例如，基于特征，计算设备还可以获取针对语音数据的第三质量等级，第三质量等级与第三语言有关。Alternatively or additionally, the computing device 104 also has within the computing device 104 decoders for other languages. For example, based on the characteristics, the computing device may also obtain a third quality level for speech data, the third quality level being related to the third language.

在框206处，基于第一质量等级和第二质量等级，确定针对语音数据的目标质量等级，目标质量等级指示语音数据的语音质量。计算设备104利用第一质量等级和第二质量等级来确定目标质量等级。At block 206, based on the first quality level and the second quality level, a target quality level for the speech data is determined, the target quality level indicating the speech quality of the speech data. Computing device 104 utilizes the first quality level and the second quality level to determine the target quality level.

在一些实施例中，计算设备104从第一质量等级和第二质量等级中选取与语音数据的语言相对应的质量等级作为目标质量等级。In some embodiments, the computing device 104 selects a quality level corresponding to the language of the voice data from the first quality level and the second quality level as the target quality level.

在一些实施例中，如果计算设备104还计算第三质量等级。因此，在确定针对语音数据的目标质量等级时还包括基于第一质量等级、第二质量等级和第三质量等级来确定目标质量等级。In some embodiments, if computing device 104 also computes a third quality level. Therefore, determining the target quality level for the voice data also includes determining the target quality level based on the first quality level, the second quality level and the third quality level.

上面结合图2描述根据本公开的实施例的用于确定语音数据的质量的过程200的流程图。下面结合图3描述一个具体的确定语音数据的语音质量的示例300。在图3描述的示例中，数据的语音质量由分数表示，语音表示模型为wav2vec2模型。The flow chart of the process 200 for determining the quality of voice data according to an embodiment of the present disclosure is described above with reference to FIG. 2 . A specific example 300 of determining the voice quality of voice data is described below with reference to FIG. 3 . In the example depicted in Figure 3, the speech quality of the data is represented by a score, and the speech representation model is a wav2vec2 model.

如图3所示，合成语音数据302输入经训练的wav2vec2模型304。在一个示例，经训练的wav2vec2模型304是对预训练的wav2vec2模型利用一组语音数据及对应的文本进行微调来获得的。在另一个示例，经训练的wav2vec2模型304是在对预训练的wav2vec2模型进行微调后，还需要在后面训练解码器时再利用样本语音数据和样本质量等级进一步进行调整。As shown in FIG. 3 , synthesized speech data 302 is input to a trained wav2vec2 model 304 . In one example, the trained wav2vec2 model 304 is obtained by fine-tuning a pre-trained wav2vec2 model using a set of speech data and corresponding text. In another example, the trained wav2vec2 model 304 is fine-tuned on the pre-trained wav2vec2 model, and needs to be further adjusted by using the sample speech data and the sample quality level when training the decoder later.

预训练的wav2vec2模型是利用大量的语音数据进行自学习来获得的。预训练的wav2vec2由三部分组成：由卷积神经网络组成的编码器、Transformer组成的上下文处理器和一个量化器构成。其输入为原始的语音信号，编码器能够将采样率为16kHz的语音信号每隔20ms将25ms的音段编码成一个隐向量，上下文处理器可以在当前音段上再考虑整条语音中来自其他音段的信息，将隐向量进一步处理成上下文相关的音段表征。量化器只在wav2vec2预训练时使用。该预训练的wav2vec2模型可以是由另外的提供方提供的，也可以是由用户自己利用语音数据训练得到的。The pre-trained wav2vec2 model is obtained by using a large amount of speech data for self-learning. The pre-trained wav2vec2 consists of three parts: an encoder composed of a convolutional neural network, a context processor composed of a Transformer, and a quantizer. Its input is the original speech signal. The encoder can encode the speech signal with a sampling rate of 16kHz and encode the 25ms segment every 20ms into a hidden vector. The context processor can then consider the entire speech from other segments on the current segment. The information of the segment, and the latent vector is further processed into a context-dependent representation of the segment. The quantizer is only used during wav2vec2 pre-training. The pre-trained wav2vec2 model may be provided by another provider, or may be trained by the user himself using voice data.

在对预训练的wav2vec2模型进行微调时，预训练的wav2vec2会保留卷积神经网络的编码器和Transformer的上下文处理器的参数，并在后面添加一层初始化参数的全连接层。在微调时，利用样本语音数据及对应的文本来训练预训练的wav2vec2，并且利用连接主义时间分类CTC损失函数来调整wav2vec2的参数。在以语音识别的方式来微调wav2vec2后，微调的wav2vec2就具有了识别发音的能力，并且确定的上下文表示与语音数据对应的文本内容相关。微调的wav2vec2在本方案中进行推理应用时会移除训练时增加的最后一层全连接层。When fine-tuning the pre-trained wav2vec2 model, the pre-trained wav2vec2 will retain the parameters of the encoder of the convolutional neural network and the context processor of the Transformer, and then add a fully connected layer with initialization parameters. When fine-tuning, the pre-trained wav2vec2 is trained using sample speech data and corresponding text, and the parameters of wav2vec2 are adjusted using the connectionist temporal classification CTC loss function. After fine-tuning wav2vec2 in the way of voice recognition, the fine-tuned wav2vec2 has the ability to recognize pronunciation, and the determined context representation is related to the text content corresponding to the voice data. The fine-tuned wav2vec2 will remove the last fully connected layer added during training when inferring in this solution.

经训练的wav2vec2模型304处理合成的语音数据303以获得上下文向量306。然后将获得的上下文向量306分别输入解码器A310和解码器B 314。解码器A 310用于对英文的语音数据进行打分，解码器B用于对中文的语音数据进行打分。在进行打分时，解码器A 310还需要输入英文虚拟打分员ID 308，解码器B 314还需要输入中文虚拟打分员ID 312。然后解码器A和解码器B分别获得对应的虚拟打分员的分数。然后在分数选择模块316中依据合成语音数据303的语言来选取与该语言相对应的分数作为评价合成语音数据303的分数318。例如，如果合成语音数据是英文的，则选取解码器A 310的分数，如果合成语音数据是中文的，则选取解码器B 314的分数。上述示例仅是用于描述本公开，而非对本公开的具体限定。其中解码器的数量和对应的语言可以由用户依据实际需要设置。例如，在训练解码器时可以利用多种语言的语音数据训练多个解码器。A trained wav2vec2 model 304 processes synthesized speech data 303 to obtain context vectors 306 . The obtained context vectors 306 are then input to decoder A 310 and decoder B 314 respectively. Decoder A 310 is used to score English voice data, and decoder B is used to score Chinese voice data. When scoring, the decoder A 310 also needs to input the English virtual scorer ID 308 , and the decoder B 314 also needs to input the Chinese virtual scorer ID 312 . Then decoder A and decoder B respectively obtain the scores of the corresponding virtual scorers. Then, in the score selection module 316 , according to the language of the synthesized voice data 303 , the score corresponding to the language is selected as the score 318 for evaluating the synthesized voice data 303 . For example, if the synthesized speech data is in English, the score of decoder A 310 is chosen, and if the synthesized speech data is in Chinese, the score of decoder B 314 is chosen. The above examples are only used to describe the present disclosure, rather than to specifically limit the present disclosure. The number of decoders and the corresponding languages can be set by the user according to actual needs. For example, multiple decoders can be trained with speech data in multiple languages when training the decoders.

上面描述了利用语音表示模型和解码器进行的推理过程，下面结合图4描述根据本公开的实施例的用于训练解码器和语音表示模型的过程400的示意图。过程400可以由图1中的计算设备或其他的合适的计算设备执行。The reasoning process using the speech representation model and the decoder is described above, and a schematic diagram of a process 400 for training a decoder and a speech representation model according to an embodiment of the present disclosure is described below in conjunction with FIG. 4 . Process 400 may be performed by the computing device in FIG. 1 or other suitable computing devices.

该训练过程是通过利用样本语音数据和样本分数来获得经调整的语音表示模型、经训练的第一解码器和经训练的第二解码器。The training process is by using sample speech data and sample scores to obtain an adjusted speech representation model, a trained first decoder and a trained second decoder.

在框402处，计算设备104先获取微调的语音表示模型、第一解码器和第二解码器。在一个示例列中，微调的语音表示模型、第一解码器和第二解码器是从其他设备接收的。在另一个示例中，微调的语音表示模型、第一解码器和第二解码器是从计算设备104获取的。At block 402, the computing device 104 first obtains a fine-tuned speech representation model, a first decoder, and a second decoder. In one example column, the fine-tuned speech representation model, first decoder and second decoder are received from other devices. In another example, the fine-tuned speech representation model, the first decoder, and the second decoder are obtained from the computing device 104 .

在框404处，计算设备104获取第一组样本语音数据及对应的样本质量等级。计算设备会获取与该一组样本数据相对应的样本质量等级。例如，每个样本语音数据有四个样本质量等级，其中三个是由真实的提供者提供的质量等级，第四个是由虚拟提供者提供的质量等级，第四个质量等级是其他三个质量等级的平均。上述示例仅是用于描本公开，而非对本公开的具体限定。本领域技术人员可以依据需要设置质量等级的数目以及质量等级的构成。At block 404, computing device 104 obtains a first set of sample speech data and corresponding sample quality levels. The computing device obtains a sample quality level corresponding to the set of sample data. For example, each sample speech data has four sample quality levels, three of which are provided by the real provider, the fourth is provided by the virtual provider, and the fourth is the quality of the other three Average quality rating. The above examples are only used to describe the disclosure, rather than to specifically limit the disclosure. Those skilled in the art can set the number of quality levels and the composition of the quality levels according to needs.

在框406处，计算设备104接着确定针对提供者的第一组标识，提供者用于提供样本质量等级。这些提供者包括一个虚拟提供者的标识。例如，有四个提供者标识，三个是真实的提供者标识，另一个是虚拟的提供者标识。At block 406, the computing device 104 then determines a first set of identifications for providers for providing sample quality ratings. These providers include the identity of a virtual provider. For example, there are four provider IDs, three are real provider IDs and one is a dummy provider ID.

在框408处，计算设备104通过利用第一组样本语音数据、对应的样本质量等级以及第一组标识联合训练微调的语音表示模型和第一解码器或第二解码器，来获得所述经调整的语音表示模型和经训练的第一解码器或经训练的第二解码器。At block 408, the computing device 104 obtains the experienced speech representation model by jointly training the fine-tuned speech representation model and the first decoder or the second decoder using the first set of sample speech data, the corresponding sample quality level, and the first set of identifications. An adapted speech representation model and either a trained first decoder or a trained second decoder.

在该训练过程中，计算设备104从一组样本语音数据中获取样本语音数据，然后通过微调的语音表示模型来获得对应的特征。然后将该特征输入与该样本语音数据中的语言相对应的解码器。此外，在解码器还需要输入针对该语音数据的样本质量等级的提供者的标识。例如提供者有4个，则该解码器会输出四个分数，然后依据输出的四个分数和四个提供者给的分数，来调整微调的语音表示模型和对应的解码器的参数。During the training process, the computing device 104 obtains sample speech data from a set of sample speech data, and then obtains corresponding features through a fine-tuned speech representation model. This feature is then input to a decoder corresponding to the language in the sample speech data. In addition, the identifier of the provider of the sample quality level for the speech data also needs to be input in the decoder. For example, if there are four providers, the decoder will output four scores, and then adjust the fine-tuned speech representation model and the parameters of the corresponding decoder according to the four output scores and the scores given by the four providers.

上面描述了联合训练微调的语音表示模型和解码器。在一些实施例中，可以通过单独训练第一解码器和第二解码器，而不需要与微调的语音表示模型一起训练。此时将微调的语音表示模型认为是训练好的语音表示模型。计算设备104利用微调的语音表示模型确定与第二组样本语音数据相对应的样本特征。然后计算设备104确定针对提供者的第二组标识，提供者用于提供样本质量等级。接着计算设备104利用样本特征、第二组标识和样本质量等级来训练第一解码器或第二解码器，来获得经训练的第一解码器或所述经训练的第二解码器。The joint training of fine-tuned speech representation models and decoders is described above. In some embodiments, the first decoder and the second decoder can be trained independently without training together with the fine-tuned speech representation model. At this time, the fine-tuned speech representation model is regarded as a trained speech representation model. Computing device 104 utilizes the fine-tuned speech representation model to determine sample features corresponding to the second set of sample speech data. Computing device 104 then determines a second set of identifications for providers that are used to provide sample quality ratings. The computing device 104 then trains the first decoder or the second decoder using the sample features, the second group identifier and the sample quality level to obtain a trained first decoder or the trained second decoder.

在一些实施例中，该微调的语音表示模型可以是在由计算设备104训练的，可也是由其他计算设备训练的。在获取经微调的语音表示模型时，先获取预训练的语音表示模型。然后果获取第三组样本语音数据和对应的样本文本。接着通过利用第三组样本语音数据和对应的样本文本训练预训练的语音表示模型，来生成述微调的语音表示模型。通过该方式，可以获得与语音数据的文本有关的微调的语音表示模型，从而提高评价的准确性。In some embodiments, the fine-tuned speech representation model may be trained by computing device 104, but may also be trained by other computing devices. When obtaining the fine-tuned speech representation model, first obtain the pre-trained speech representation model. Then, the third group of sample voice data and corresponding sample text are acquired. Next, the fine-tuned speech representation model is generated by using the third set of sample speech data and corresponding sample texts to train the pre-trained speech representation model. In this way, a fine-tuned speech representation model related to the text of the speech data can be obtained, thereby improving the accuracy of evaluation.

该预训练的语音表示模型可以是由计算设备104训练的，也可以是由其他的计算设备训练的。在生成预训练的语音表示模型时，先获取第四组样本语音数据。然后通过利用第四组样本语音数据训练语音表示模型，来生成预训练的语音表示模型。其中语音表示模型为将语音信号转换为对应的上下文表示的神经网络模型。The pre-trained speech representation model may be trained by the computing device 104, or may be trained by other computing devices. When generating the pre-trained speech representation model, the fourth set of sample speech data is obtained first. Then, a pre-trained speech representation model is generated by using the fourth set of sample speech data to train the speech representation model. The speech representation model is a neural network model that converts speech signals into corresponding context representations.

备选地，上面示出了在计算设备104训练第一编码器和第二编码器的过程，也可以在其他计算设备上训练第一编码器和第二编码器。Alternatively, the process of training the first encoder and the second encoder on the computing device 104 is shown above, and the first encoder and the second encoder may also be trained on other computing devices.

在一些实施例中，第一解码器为针对中文的解码器，第二解码器是针对英文的解码器。In some embodiments, the first decoder is a decoder for Chinese, and the second decoder is a decoder for English.

上面结合图4描述了示例训练过程，下面结合图5描述用于训练模型的示例500。An example training process is described above in conjunction with FIG. 4 , and an example 500 for training a model is described below in conjunction with FIG. 5 .

如图5所示，在训练语音质量评价模型时，该模型包括三部分，微调的wav2vec2、解码器A 512和解码器B 514。此时微调的wav2vec2移除了最后一层全连接层。在进行训练时，英文合成音频数据502或中文合成音频数据504输入微调的wav2vec2模型506，然后生成对应的上下文向量508，该上下文向量508被输入解码器A 512或解码器B 514。如果该上下文向量508是英文合成音频数据，则进入解码器A 512。同时解码器A 512还输入一组英文合成数据的打分员ID 510，基中一个是虚拟打分员ID，虚拟打分员ID对应的分数为该组中其他打分员的分数的平均数。然后获得对应于不同打分员的分数518。然后将获得的分数与针对该语音数据的样本分数进行比较来调整微调的wav2vec2和解码器A的参数。As shown in FIG. 5 , when training the speech quality evaluation model, the model includes three parts, fine-tuned wav2vec2, decoder A 512 and decoder B 514 . At this time, the fine-tuned wav2vec2 removed the last fully connected layer. During training, the English synthesized audio data 502 or the Chinese synthesized audio data 504 is input into the fine-tuned wav2vec2 model 506, and then the corresponding context vector 508 is generated, and the context vector 508 is input into the decoder A 512 or the decoder B 514. If the context vector 508 is English synthesized audio data, then enter decoder A 512 . At the same time, decoder A 512 also inputs a set of scorer IDs 510 of English synthetic data, one of which is a virtual scorer ID, and the score corresponding to the virtual scorer ID is the average of the scores of other scorers in the group. The scores corresponding to the different raters are then obtained 518 . The obtained scores are then compared with the sample scores for this speech data to adjust the parameters of the fine-tuned wav2vec2 and decoder A.

如果该上下文向量508是中文合成音频数据，则进入解码器B 514，同时解码器B514还输入一组中文合成数据的打分员ID 516，基中一个是虚拟打分员的ID，虚拟打分员ID对应的分数为该组中其他打分员的分数的平均值。然后获得对应于不同打分员的分数520。然后将获得的分数与针对该语音数据的样本分数进行比较来调整微调的wav2vec2模型506和解码器B 514的参数。If this context vector 508 is Chinese synthetic audio data, then enter decoder B 514, simultaneously decoder B514 also imports the scorer ID 516 of a group of Chinese synthetic data, one of the bases is the ID of the virtual scorer, and the virtual scorer ID corresponds to The score is the average of the scores of the other scorers in the group. The scores corresponding to the different raters are then obtained 520 . The obtained scores are then compared to the sample scores for the speech data to adjust the fine-tuned wav2vec2 model 506 and decoder B 514 parameters.

图5描述了利用样本语音数据和样本分数同时调整微调的wav2vec2、解码器A 512和解码器B 514。在一些实施例中，可以不调整微调的wav2vec2的参数，在训练时利用微调的wav2vec2生成针对样本语音数据的上下文向量，然后结合样本分数来调节解码器A和解码器B的参数以实现对解码器A和B的训练。Figure 5 depicts wav2vec2, decoder A 512 and decoder B 514 fine-tuned simultaneously using sample speech data and sample scores. In some embodiments, the fine-tuned wav2vec2 parameters may not be adjusted, and the fine-tuned wav2vec2 is used to generate the context vector for the sample speech data during training, and then the parameters of decoder A and decoder B are adjusted in combination with the sample scores to realize the decoding Training of devices A and B.

图6示出了根据本公开实施例的用于确定文本中说话者的装置的示意性框图。如图6所示，装置600包括特征确定模块602，被配置为确定所获取的语音数据的特征；质量等级获取模块604，被配置为基于特征，获取针对语音数据的第一质量等级和第二质量等级，第一质量等级与第一语言有关，第二质量等级与第二语言有关；以及目标质量等级确定模块606，被配置为基于第一质量等级和第二质量等级，确定针对语音数据的目标质量等级，目标质量等级指示语音数据的语音质量。Fig. 6 shows a schematic block diagram of an apparatus for determining a speaker in a text according to an embodiment of the present disclosure. As shown in FIG. 6 , the device 600 includes a feature determination module 602 configured to determine the features of the acquired voice data; a quality level acquisition module 604 configured to acquire the first quality level and the second quality level for the voice data based on the features. Quality level, the first quality level is related to the first language, and the second quality level is related to the second language; and the target quality level determination module 606 is configured to determine the target quality level for the voice data based on the first quality level and the second quality level The target quality level indicates the voice quality of the voice data.

在一些实施例中，质量等级获取模块604包括：第一质量等级获取模块，被配置为通过将经训练的第一解码器应用于特征来获取第一质量等级；以及第二质量等级获取模块，被配置为通过将经训练的第二解码器应用于特征来获取第二质量等级。In some embodiments, the quality level obtaining module 604 includes: a first quality level obtaining module configured to obtain the first quality level by applying the trained first decoder to the feature; and a second quality level obtaining module, configured to obtain the second quality level by applying the trained second decoder to the features.

在一些实施例中，特征确定模块包括:第一特征获取模块，被配置为将经调整的语音表示模型应用于语音数据来获得特征。In some embodiments, the feature determination module includes: a first feature acquisition module configured to apply the adjusted speech representation model to the speech data to obtain features.

在一些实施例中，还包括联合训练模块，被配置为通过以下模块获取经调整的语音表示模型、经训练的第一解码器和经训练的第二解码器：模型和解码器获取模块，被配置为获取微调的语音表示模型、第一解码器和第二解码器；样本质量等级获取模块，被配置为获取第一组样本语音数据及对应的样本质量等级；第一组标识确定模块，被配置为确定针对提供者的第一组标识，所述提供者用于提供所述样本质量等级；训练获得模块，被配置为通过利用所述第一组样本语音数据、对应的样本质量等级以及所述第一组标识联合训练所述微调的语音表示模型和所述第一解码器或所述第二解码器，来获得所述经调整的语音表示模型和所述经训练的第一解码器或所述经训练的第二解码器。In some embodiments, a joint training module is also included, configured to obtain the adjusted speech representation model, the trained first decoder, and the trained second decoder by: a model and decoder acquisition module, which is Configured to acquire the fine-tuned speech representation model, the first decoder and the second decoder; the sample quality level acquisition module is configured to acquire the first set of sample speech data and the corresponding sample quality level; the first set of identification determination module is configured configured to determine a first set of identifications for providers, the provider is used to provide the sample quality level; the training acquisition module is configured to use the first set of sample speech data, the corresponding sample quality level and the The first set of identifiers jointly trains the fine-tuned speech representation model and the first decoder or the second decoder to obtain the adjusted speech representation model and the trained first decoder or The trained second decoder.

在一些实施例中，特征确定模块包括：微调的语音表示模型获取模块，被配置为获取微调的语音表示模型；第二特征获取模块，被配置为通过将所述微调的语音表示模型应用于所述语音数据来获得所述特征。In some embodiments, the feature determination module includes: a fine-tuned speech representation model acquisition module configured to obtain a fine-tuned speech representation model; a second feature acquisition module configured to apply the fine-tuned speech representation model to the The speech data is used to obtain the features.

在一些实施例中，解码器获取模块，被配置为还包括通过以下模块获取所述经训练的第一解码器和所述经训练的第二解码器：样本特征确定模块，被配置为确定与第二组样本语音数据相对应的样本特征；第二组标识确定模块，被配置为确定针对提供者的第二组标识，所述提供者用于提供所述样本质量等级；解码器训练模块，被配置为利用所述样本特征、所述第二组标识和所述样本质量等级来训练所述第一解码器或所述第二解码器，来获得所述经训练的第一解码器或所述经训练的第二解码器。In some embodiments, the decoder obtaining module is configured to further include obtaining the trained first decoder and the trained second decoder by: a sample feature determination module configured to determine the same The sample features corresponding to the second group of sample speech data; the second group identification determination module is configured to determine the second group identification for the provider, and the provider is used to provide the sample quality level; the decoder training module, configured to train the first decoder or the second decoder using the sample features, the second set of identifications and the sample quality level to obtain the trained first decoder or the The trained second decoder described above.

在一些实施例中，其中获取所述微调的语音表示模型包括：预训练模型获取模块，被配置为获取预训练的语音表示模型；对应的样本文本获取模块，被配置为获取第三组样本语音数据和对应的样本文本；以及第一生成模块，被配置为通过利用所述第三组样本语音数据和所述对应的样本文本训练所述预训练的语音表示模型，来生成述微调的语音表示模型。In some embodiments, obtaining the fine-tuned speech representation model includes: a pre-training model acquisition module configured to obtain a pre-trained speech representation model; a corresponding sample text acquisition module configured to obtain a third group of sample speech data and corresponding sample text; and a first generation module configured to generate the fine-tuned speech representation by using the third set of sample speech data and the corresponding sample text to train the pre-trained speech representation model Model.

在一些实施例中，预训练模型获取模块包括：第四组样本语音数据获取模块，被配置为获取第四组样本语音数据；以及预训练的语音表示模型的生成模块，被配置为通过利用所述第四组样本语音数据训练语音表示模型，来生成所述预训练的语音表示模型。In some embodiments, the pre-training model acquisition module includes: a fourth group of sample speech data acquisition module configured to obtain a fourth group of sample speech data; and a pre-trained speech representation model generation module configured to use the The fourth group of sample voice data is used to train the voice representation model to generate the pre-trained voice representation model.

在一些实施例中，其中所述语音表示模型为将语音信号转换为对应的特征的神经网络模型。In some embodiments, the speech representation model is a neural network model that converts speech signals into corresponding features.

在一些实施例中，其中所述特征与对应于所述语音数据的文本有关。In some embodiments, wherein the features are related to text corresponding to the speech data.

在一些实施例中，装置600还包括：第三质量等级获取模块，被配置为基于所述特征，获取针对所述语音数据的第三质量等级，所述第三质量等级与第三语言有关；并且目标质量等级确定模块还包括：基于三个质量等级的确定模块，被配置为基于所述第一质量等级、所述第二质量等级和所述第三质量等级来确定述目标质量等级。In some embodiments, the apparatus 600 further includes: a third quality level acquiring module configured to acquire a third quality level for the voice data based on the feature, the third quality level being related to a third language; And the target quality level determination module further includes: a determination module based on three quality levels, configured to determine the target quality level based on the first quality level, the second quality level and the third quality level.

在一些实施例中，其中目标质量等级确定模块包括：选择模块，被配置为从所述第一质量等级和所述第二质量等级中选取与所述语音数据的语言相对应的质量等级作为所述目标质量等级。In some embodiments, the target quality level determination module includes: a selection module configured to select a quality level corresponding to the language of the voice data from the first quality level and the second quality level as the selected target quality level.

在一些实施例中，装置600还包括:语音数据生成模块，被配置为通过语音合成或语音转换来生成所述语音数据。In some embodiments, the apparatus 600 further includes: a speech data generation module configured to generate the speech data through speech synthesis or speech conversion.

图7示出了可以用来实施本公开的实施例的示例设备700的示意性框图。图1中的计算设备104可以利用设备700来实现。如图所示，设备700包括中央处理单元(CPU)701，其可以根据存储在只读存储器(ROM)702中的计算机程序指令或者从存储单元708加载到随机访问存储器(RAM)703中的计算机程序指令，来执行各种适当的动作和处理。在RAM 703中，还可存储设备700操作所需的各种程序和数据。CPU 701、ROM 702以及RAM 703通过总线704彼此相连。输入/输出(I/O)接口705也连接至总线704。FIG. 7 shows a schematic block diagram of an example device 700 that may be used to implement embodiments of the present disclosure. Computing device 104 in FIG. 1 may be implemented with device 700 . As shown, the device 700 includes a central processing unit (CPU) 701 that can be programmed according to computer program instructions stored in a read-only memory (ROM) 702 or loaded from a storage unit 708 into a random-access memory (RAM) 703 program instructions to perform various appropriate actions and processes. In the RAM 703, various programs and data necessary for the operation of the device 700 can also be stored. The CPU 701 , ROM 702 , and RAM 703 are connected to each other via a bus 704 . An input/output (I/O) interface 705 is also connected to the bus 704 .

设备700中的多个部件连接至I/O接口705，包括：输入单元706，例如键盘、鼠标等；输出单元707，例如各种类型的显示器、扬声器等；存储页面708，例如磁盘、光盘等；以及通信单元709，例如网卡、调制解调器、无线通信收发机等。通信单元709允许设备700通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Multiple components in the device 700 are connected to the I/O interface 705, including: an input unit 706, such as a keyboard, a mouse, etc.; an output unit 707, such as various types of displays, speakers, etc.; a storage page 708, such as a magnetic disk, an optical disk, etc. ; and a communication unit 709, such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 709 allows the device 700 to exchange information/data with other devices over a computer network such as the Internet and/or various telecommunication networks.

上文所描述的各个过程和处理，例如方法200和400，可由处理单元701执行。例如，在一些实施例中，方法200和400可被实现为计算机软件程序，其被有形地包含于机器可读介质，例如存储单元708。在一些实施例中，计算机程序的部分或者全部可以经由ROM 702和/或通信单元709而被载入和/或安装到设备700上。当计算机程序被加载到RAM703并由CPU 701执行时，可以执行上文描述的方法200和400的一个或多个动作。The various procedures and processes described above, such as the methods 200 and 400 , can be executed by the processing unit 701 . For example, in some embodiments, methods 200 and 400 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708 . In some embodiments, part or all of the computer program may be loaded and/or installed on the device 700 via the ROM 702 and/or the communication unit 709 . When the computer program is loaded into RAM 703 and executed by CPU 701, one or more actions of methods 200 and 400 described above may be performed.

本公开可以是方法、装置、系统和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质，其上载有用于执行本公开的各个方面的计算机可读程序指令。The present disclosure may be a method, apparatus, system and/or computer program product. A computer program product may include a computer-readable storage medium having computer-readable program instructions thereon for carrying out various aspects of the present disclosure.

计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是——但不限于——电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括：便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身，诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如，通过光纤电缆的光脉冲)、或者通过电线传输的电信号。A computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device. A computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or flash memory), static random access memory (SRAM), compact disc read only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanically encoded device, such as a printer with instructions stored thereon A hole card or a raised structure in a groove, and any suitable combination of the above. As used herein, computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.

这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备，或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令，并转发该计算机可读程序指令，以供存储在各个计算/处理设备中的计算机可读存储介质中。Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .

用于执行本公开操作的计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码，所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++等，以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中，远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机，或者，可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中，通过利用计算机可读程序指令的状态信息来个性化定制电子电路，例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA)，该电子电路可以执行计算机可读程序指令，从而实现本公开的各个方面。Computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or Source or object code written in any combination, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages—such as the “C” language or similar programming languages. Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In cases involving a remote computer, the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as via the Internet using an Internet service provider). connect). In some embodiments, an electronic circuit, such as a programmable logic circuit, field programmable gate array (FPGA), or programmable logic array (PLA), can be customized by utilizing state information of computer-readable program instructions, which can Various aspects of the present disclosure are implemented by executing computer readable program instructions.

这里参照根据本公开实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解，流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合，都可以由计算机可读程序指令实现。Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It should be understood that each block of the flowcharts and/or block diagrams, and combinations of blocks in the flowcharts and/or block diagrams, can be implemented by computer-readable program instructions.

这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理单元，从而生产出一种机器，使得这些指令在通过计算机或其它可编程数据处理装置的处理单元执行时，产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中，这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作，从而，存储有指令的计算机可读介质则包括一个制造品，其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processing unit of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.

也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上，使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤，以产生计算机实现的过程，从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。It is also possible to load computer-readable program instructions into a computer, other programmable data processing device, or other equipment, so that a series of operational steps are performed on the computer, other programmable data processing device, or other equipment to produce a computer-implemented process , so that instructions executed on computers, other programmable data processing devices, or other devices implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.

附图中的流程图和框图显示了根据本公开的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分，所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个连续的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或动作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, a portion of a program segment, or an instruction that includes one or more Executable instructions. In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions.

以上已经描述了本公开的各实施例，上述说明是示例性的，并非穷尽性的，并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下，对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择，旨在最好地解释各实施例的原理、实际应用或对市场中的技术的技术改进，或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。Having described various embodiments of the present disclosure above, the foregoing description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and alterations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles of the various embodiments, practical applications or technical improvements over technologies in the market, or to enable other persons of ordinary skill in the art to understand the various embodiments disclosed herein.

Claims

1. A method of determining the quality of speech data, comprising:

determining characteristics of the acquired voice data;

based on the features, obtaining a first quality level and a second quality level for the voice data, the first quality level being related to a first language and the second quality level being related to a second language; and

determining a target quality level for the voice data based on the first quality level and the second quality level, the target quality level indicating a voice quality of the voice data.

2. The method of claim 1, obtaining the first quality level and the second quality level comprises:

obtaining the first quality level by applying a trained first decoder to the feature; and

obtaining the second quality level by applying a trained second decoder to the feature.

3. The method of claim 2, wherein the determining a characteristic of the acquired voice data comprises:

applying the adapted speech representation model to the speech data to obtain the feature.

4. The method of claim 3, further comprising:

obtaining a fine-tuned speech representation model, a first decoder and a second decoder;

acquiring a first group of sample voice data and corresponding sample quality grades;

determining a first set of identities for providers that are to provide the sample quality rating;

obtaining the adjusted speech representation model and the trained first decoder or the trained second decoder by training the fine-tuned speech representation model and the first decoder or the second decoder using the first set of sample speech data, corresponding sample quality levels, and the first set of identifications.

5. The method of claim 2, wherein the determining characteristics of the acquired speech data comprises:

acquiring a fine-tuned voice representation model;

obtaining the feature by applying the fine-tuned speech representation model to the speech data.

6. The method of claim 5, further comprising:

determining sample features corresponding to the second set of sample speech data;

determining a second set of identities for providers that are to provide the sample quality rating;

training the first decoder or the second decoder using the sample features, the second set of identifications, and the sample quality levels to obtain the trained first decoder or the trained second decoder.

7. The method of claim 4 or 5, wherein obtaining the fine-tuned speech representation model comprises:

acquiring a pre-trained voice representation model;

acquiring a third group of sample voice data and corresponding sample texts; and

generating the fine-tuned speech representation model by training the pre-trained speech representation model using the third set of sample speech data and the corresponding sample text.

8. The method of claim 7, wherein obtaining the pre-trained speech representation model comprises:

acquiring a fourth group of sample voice data; and

generating the pre-trained speech representation model by training a speech representation model using the fourth set of sample speech data.

9. The method of claim 8, wherein the speech representation model is a neural network model that converts speech signals into corresponding features.

10. The method of claim 1, wherein the feature relates to text corresponding to the speech data.

11. The method of claim 1, further comprising:

based on the features, obtaining a third quality level for the speech data, the third quality level being related to a third language; and is provided with

Determining a target quality level for the voice data further comprises:

determining the target quality level based on the first quality level, the second quality level, and the third quality level.

12. The method of claim 1, wherein determining the target quality level comprises:

selecting a quality class corresponding to a language of the voice data from the first quality class and the second quality class as the target quality class.

13. The method of claim 1, further comprising:

the voice data is generated by voice synthesis or voice conversion.

14. An apparatus for determining quality of speech data, comprising:

a feature determination module configured to determine a feature of the acquired voice data;

a quality level obtaining module configured to obtain a first quality level and a second quality level for the voice data based on the feature, the first quality level being related to a first language, the second quality level being related to a second language; and

a target quality level determination module configured to determine a target quality level for the voice data based on the first quality level and the second quality level, the target quality level indicating a voice quality of the voice data.

15. An electronic device, comprising:

at least one processor; and

storage means for storing at least one program which, when executed by the at least one processor, causes the at least one processor to carry out the method according to any one of claims 1-13.

16. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-13.