CN115312078A - Method, apparatus, device and storage medium for determining quality of voice data - Google Patents
Method, apparatus, device and storage medium for determining quality of voice data Download PDFInfo
- Publication number
- CN115312078A CN115312078A CN202210939917.2A CN202210939917A CN115312078A CN 115312078 A CN115312078 A CN 115312078A CN 202210939917 A CN202210939917 A CN 202210939917A CN 115312078 A CN115312078 A CN 115312078A
- Authority
- CN
- China
- Prior art keywords
- quality level
- speech
- quality
- decoder
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/60—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Quality & Reliability (AREA)
- Machine Translation (AREA)
Abstract
本公开的实施例涉及用于确定语音数据的质量的方法、装置、设备和存储介质。该方法包括确定所获取的语音数据的特征。该方法还包括基于特征,获取针对语音数据的第一质量等级和第二质量等级,第一质量等级与第一语言有关,第二质量等级与第二语言有关。该方法还包括基于第一质量等级和第二质量等级,确定针对语音数据的目标质量等级,目标质量等级指示语音数据的语音质量。通过该方法,能够提高跨领域评价语音数据的质量的准确性和效率,并且提高了数据的利用率,改进了用户体验。
Embodiments of the present disclosure relate to a method, apparatus, device, and storage medium for determining the quality of voice data. The method includes determining characteristics of the acquired speech data. The method also includes obtaining, based on the feature, a first quality level and a second quality level for the speech data, the first quality level being related to the first language, and the second quality level being related to the second language. The method also includes determining a target quality level for the speech data based on the first quality level and the second quality level, the target quality level indicating the speech quality of the speech data. Through the method, the accuracy and efficiency of evaluating the quality of speech data across fields can be improved, the utilization rate of data is improved, and the user experience is improved.
Description
技术领域technical field
本公开的实施例总体涉及语音数据处理领域,具体涉及用于确定语音数据的质量的方法、装置、设备和存储介质。Embodiments of the present disclosure generally relate to the field of speech data processing, and specifically relate to a method, device, device and storage medium for determining the quality of speech data.
背景技术Background technique
随着计算机技术的发展,语音处理的水平也在快速的改进。利用计算设备合成语音数据或对语音数据进行转换也越来越多的用于各种设备和应用。对于这些语音数据,可以通过语音质量进行评价,因为语音质量是反映通过语音合成、语音转换等系统性能的主要指标。平均意见得分(Mean Opinion Score,MOS)则是标注人员对合成音频进行听力测试后,对该音频的语音质量进行的主观评价分数。由于传统的MOS打分需要大量的标注人员进行参与,这一主观评价过程会导致高额的费用和过长的耗时。因此,在语音数据的处理过程中还存在许多需要解决的问题。With the development of computer technology, the level of speech processing is also improving rapidly. The use of computing devices to synthesize or convert speech data is also increasingly used in various devices and applications. For these voice data, the voice quality can be used for evaluation, because the voice quality is the main index reflecting the system performance through speech synthesis, speech conversion, etc. The Mean Opinion Score (MOS) is the subjective evaluation score of the voice quality of the audio after the annotator conducts a listening test on the synthesized audio. Since the traditional MOS scoring requires the participation of a large number of annotators, this subjective evaluation process will lead to high costs and long time consumption. Therefore, there are still many problems to be solved in the processing of voice data.
发明内容Contents of the invention
本公开的实施例提供了一种用于确定语音数据的质量的方法、装置、设备和存储介质。Embodiments of the present disclosure provide a method, device, device and storage medium for determining the quality of voice data.
根据本公开的第一方面,提供了一种确定语音数据的质量的方法。该方法包括确定所获取的语音数据的特征。该方法还包括基于特征,获取针对语音数据的第一质量等级和第二质量等级,第一质量等级与第一语言有关,第二质量等级与第二语言有关。该方法还包括基于第一质量等级和第二质量等级,确定针对语音数据的目标质量等级,目标质量等级指示语音数据的语音质量。According to a first aspect of the present disclosure, a method of determining the quality of speech data is provided. The method includes determining features of the acquired speech data. The method also includes obtaining, based on the feature, a first quality level for the speech data and a second quality level, the first quality level being related to the first language and the second quality level being related to the second language. The method also includes determining a target quality level for the speech data based on the first quality level and the second quality level, the target quality level indicating a speech quality of the speech data.
在本公开的第二方面中,提供了一种用于确定语音数据的质量的装置。该装置包括特征确定模块,被配置为确定所获取的语音数据的特征;质量等级获取模块,被配置为基于特征,获取针对语音数据的第一质量等级和第二质量等级,第一质量等级与第一语言有关,第二质量等级与第二语言有关;以及目标质量等级确定模块,被配置为基于第一质量等级和第二质量等级,确定针对语音数据的目标质量等级,目标质量等级指示语音数据的语音质量。In a second aspect of the present disclosure, an apparatus for determining quality of speech data is provided. The device includes a feature determination module configured to determine features of the acquired voice data; a quality level acquisition module configured to acquire a first quality level and a second quality level for the voice data based on the features, the first quality level and the second quality level The first language is related, the second quality level is related to the second language; and the target quality level determination module is configured to determine the target quality level for the voice data based on the first quality level and the second quality level, the target quality level indicates the voice Data voice quality.
在本公开的第三方面中,提供了一种电子设备,包括至少一个处理器;以及存储装置,用于存储至少一个程序,当至少一个程序被至少一个处理器执行,使得至少一个处理器实现根据本公开的第一方面的方法。In a third aspect of the present disclosure, an electronic device is provided, including at least one processor; and a storage device for storing at least one program, when the at least one program is executed by the at least one processor, the at least one processor realizes A method according to the first aspect of the present disclosure.
在本公开的第四方面中,提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现根据本公开的第一方面的方法。In a fourth aspect of the present disclosure, there is provided a computer-readable storage medium on which a computer program is stored, the program implements the method according to the first aspect of the present disclosure when executed by a processor.
应当理解,该内容部分中所描述的内容并非旨在限定本公开的实施例的关键或重要特征,亦非用于限制本公开的范围。本公开的其它特征将通过以下的描述变得容易理解。It should be understood that the content described in this summary is not intended to limit the key or important features of the embodiments of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will be readily understood through the following description.
附图说明Description of drawings
通过结合附图对本公开示例性实施例进行更详细的描述,本公开的上述以及其它目的、特征和优势将变得更加明显,其中,在本公开示例性实施例中,相同的参考标号通常代表相同部件。The above and other objects, features and advantages of the present disclosure will become more apparent by describing the exemplary embodiments of the present disclosure in more detail with reference to the accompanying drawings, wherein, in the exemplary embodiments of the present disclosure, the same reference numerals generally represent same parts.
图1图示了本公开的实施例的设备和/或方法可以在其中被实施的示例环境100的示意图;FIG. 1 illustrates a schematic diagram of an
图2图示了根据本公开的实施例的用于确定语音数据的质量的过程200的流程图;FIG. 2 illustrates a flowchart of a
图3图示了根据本公开的实施例的用于生成语音数据的分数的示例300的示意图;FIG. 3 illustrates a schematic diagram of an example 300 for generating a score for speech data according to an embodiment of the present disclosure;
图4图示了根据本公开的实施例的用于训练解码器和语音表示模型的过程400的示意图;FIG. 4 illustrates a schematic diagram of a
图5图示了根据本公开的实施例的训练解码器和语音表示模型的示例500的示意图;FIG. 5 illustrates a schematic diagram of an example 500 of training a decoder and speech representation model according to an embodiment of the present disclosure;
图6图示了根据本公开实施例的用于确定语音数据的质量的装置600的示意性框图;Fig. 6 illustrates a schematic block diagram of an
图7图示了适于用来实施本公开内容的实施例的示例设备700的示意性框图。FIG. 7 illustrates a schematic block diagram of an
在各个附图中,相同或对应的标号表示相同或对应的部分。In the respective drawings, the same or corresponding reference numerals denote the same or corresponding parts.
具体实施方式Detailed ways
可以理解的是,在使用本公开各实施例公开的技术方案之前,均应当依据相关法律法规通过恰当的方式对本公开所涉及个人信息的类型、使用范围、使用场景等告知用户并获得用户的授权。It can be understood that before using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed of the type, scope of use, and use scenarios of the personal information involved in the present disclosure in an appropriate manner in accordance with relevant laws and regulations, and the authorization of the user should be obtained. .
例如,在响应于接收到用户的主动请求时,向用户发送提示信息,以明确地提示用户,其请求执行的操作将需要获取和使用到用户的个人信息。从而,使得用户可以根据提示信息来自主地选择是否向执行本公开技术方案的操作的电子设备、应用程序、服务器或存储介质等软件或硬件提供个人信息。For example, in response to receiving the user's active request, send prompt information to the user to clearly remind the user that the requested operation will require the acquisition and use of the user's personal information. Thus, the user can independently choose whether to provide personal information to software or hardware such as electronic devices, application programs, servers, or storage media that perform the operations of the technical solution of the present disclosure according to the prompt information.
作为一种可选的但非限定性的实现方式,响应于接收到用户的主动请求,向用户发送提示信息的方式例如可以是弹窗的方式,弹窗中可以以文字的方式呈现提示信息。此外,弹窗中还可以承载供用户选择“同意”或者“不同意”向电子设备提供个人信息的选择控件。As an optional but non-limiting implementation, in response to receiving the active request from the user, the prompt information may be sent to the user, for example, in the form of a pop-up window, and the prompt information may be presented in text in the pop-up window. In addition, the pop-up window may also carry a selection control for the user to choose "agree" or "disagree" to provide personal information to the electronic device.
可以理解的是,上述通知和获取用户授权过程仅是示意性的,不对本公开的实现方式构成限定,其它满足相关法律法规的方式也可应用于本公开的实现方式中。It can be understood that the above process of notifying and obtaining user authorization is only illustrative and does not limit the implementation of the present disclosure. Other methods that meet relevant laws and regulations may also be applied to the implementation of the present disclosure.
可以理解的是,本技术方案所涉及的数据(包括但不限于数据本身、数据的获取或使用)应当遵循相应法律法规及相关规定的要求。It can be understood that the data involved in this technical solution (including but not limited to the data itself, the acquisition or use of data) should comply with the requirements of corresponding laws and regulations and relevant regulations.
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the present disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein; A more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for exemplary purposes only, and are not intended to limit the protection scope of the present disclosure.
在本公开的实施例的描述中,术语“包括”及其类似用语应当理解为开放性包含,即“包括但不限于”。术语“基于”应当理解为“至少部分地基于”。术语“一个实施例”或“该实施例”应当理解为“至少一个实施例”。术语“第一”、“第二”等等可以指代不同的或相同的对象。下文还可能包括其他明确的和隐含的定义。In the description of the embodiments of the present disclosure, the term "comprising" and its similar expressions should be interpreted as an open inclusion, that is, "including but not limited to". The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be read as "at least one embodiment". The terms "first", "second", etc. may refer to different or the same object. Other definitions, both express and implied, may also be included below.
如上所述,利用标注人员对语音数据的语音质量进行评分会导致高额的费用和过长的耗时。因此,提出了MOS自动打分系统,主要是利用机器对合成音频打分,替换掉标注人员的主观评价,从而达到节省时间和资源的目的。然而,利用机器进行MOS打分至少存在两个挑战,第一个挑战是数据稀疏的问题,用来训练MOS打分系统的数据并不是很多,这会限制打分系统的性能。第二个挑战则是针对跨领域合成音频(不同语种的合成音频)的打分,各种语种的合成音频在MOS打分系统中可能由于缺少对应语种的训练数据,打分系统无法给出一个准确的分数。As mentioned above, using annotators to score the speech quality of speech data would result in high cost and time-consuming. Therefore, the MOS automatic scoring system is proposed, which mainly uses the machine to score the synthesized audio, replacing the subjective evaluation of the annotator, so as to save time and resources. However, there are at least two challenges in using machines for MOS scoring. The first challenge is the problem of data sparsity. There is not a lot of data used to train the MOS scoring system, which will limit the performance of the scoring system. The second challenge is the scoring of cross-domain synthetic audio (synthetic audio in different languages). In the MOS scoring system, the synthetic audio of various languages may not be able to give an accurate score due to the lack of training data of the corresponding language. .
在一些传统方案中设计了自动打分系统。该系统利用提前训练的编码器进行MOS自动打分。该方案在相同领域(训练集和测试集的语种一致)的数据上可以取得较好的打分结果。Automatic scoring systems are designed in some traditional schemes. The system uses pre-trained encoders for automatic MOS scoring. This scheme can achieve better scoring results on data in the same domain (the language of the training set and the test set are the same).
然而,该传统方案在跨领域任务中,存在着许多问题。例如,当不同领域的训练集分开使用时,跨领域的训练集非常稀少,会限制编码器的性能;当不同领域的训练集混合使用时,由于部分训练集和测试集的语种背景不同,系统无法很好的适配测试集的合成音频。从而导致MOS系统打分不准的问题。However, this traditional scheme has many problems in cross-domain tasks. For example, when the training sets of different domains are used separately, the cross-domain training sets are very rare, which will limit the performance of the encoder; when the training sets of different domains are mixed, due to the different language backgrounds of some training sets and test sets, the system Synthetic audio that does not fit well to the test set. This leads to the problem of inaccurate scoring by the MOS system.
另外,这些传统方案并没有考虑不同任务之间的联系,比如自动语音识别任务和MOS自动打分的关联。MOS打分数据稀疏,但自动语音识别的数据却是海量的。In addition, these traditional solutions do not consider the connection between different tasks, such as the association between automatic speech recognition tasks and MOS automatic scoring. MOS scoring data is sparse, but the data of automatic speech recognition is massive.
至少为了解决上述和其他潜在问题,本公开的实施例提出了一种确定语音数据的质量的方法。在该方法中,计算设备确定所获取的语音数据的特征。然后利用获得的特征,获取针对语音数据的第一质量等级和第二质量等级。这些质量等级与语言的种类有关。然后计算设备基于第一质量等级和第二质量等级,确定针对语音数据的目标质量等级。通过该方法,能够提高跨领域评价语音数据的质量的准确性和效率,并且提高了数据的利用率,改进了用户体验。To address at least the above and other potential problems, embodiments of the present disclosure propose a method of determining the quality of speech data. In the method, a computing device determines characteristics of acquired speech data. Then using the obtained features, a first quality level and a second quality level for the voice data are obtained. These quality levels are related to the kind of language. The computing device then determines a target quality level for the voice data based on the first quality level and the second quality level. Through the method, the accuracy and efficiency of cross-domain evaluation of the quality of voice data can be improved, the utilization rate of data can be improved, and user experience can be improved.
下面将进一步结合附图来详细描述本公开的实施例,其中图1示出了本公开的实施例的设备和/或方法可以在其中被实施的示例环境100。Embodiments of the present disclosure will be further described in detail below with reference to the accompanying drawings, wherein FIG. 1 shows an
在环境100中包括计算设备104,计算设备104用于对语音数据102进行处理来确定语音数据102中的目标质量等级112。Included in the
计算设备104的示例包括但不限于个人计算机、服务器计算机、手持或膝上型设备、移动设备(诸如移动电话、个人数字助理(PDA)、媒体播放器等)、多处理器系统、消费电子产品、小型计算机、大型计算机、包括上述系统或设备中的任意一个的分布式计算环境等。Examples of
计算设备104接收的语音数据102是包括讲话者的语音数据。该语音数据包括但不限于合成的语音数据或转换的语言数据。对于合成的语音数据,其包括由计算设备合成的模拟人说话的语音。对于转换的语音数据,其包括将说话人说话的语音转换为另外一个说话人的语音或者将说话人说的内容转换成另一种语言。图1示出了计算设备104接收语音数据102。其仅是示例,而非对本公开的限定。计算设备104也可以生成语音数据或者语音数据存储在计算设备104的本地存储器中。
计算设备104在获得了语音数据102后,可以对语音数据102进行处理来获得语音数据的特征106。在一个示例,计算设备可以运行训练好的wav2vec2模型来提取语音数据的上下文表示作为特征106。在另一个示例中,计算设备运行其他的语音表示模型来获得语音数据的特征,诸如任意合适的将语音信息转换为向量表示的语音表示模型均可用于获取语音数据的特征。上述示例仅是用于描述本公开,而非对本公开的具体限定。本领域技术人员可以采用任意合适的模型来获得语音数据的特征。After the
计算设备104还可以基于特征106来获得针对该语音数据的第一质量等级和第二质量等级。其中第一质量等级与第一语言有关,第二质量等级与第二语言有关。在一个示例中,第一语言为英语,第二语言为中文。在另一个示例中,第一语言为中文,第二语言为英语。上述示例仅是用于描述本公开,而非对本公开的具体限定。本领域技术人员可以依据需要设置第一语言和第二语言具体为何种语言。在一些实施例中,第一质量等级是利用与第一语言有关的编码器生成的,第二质量等级是利用与第二语言有关的编码器生成的。
图1中示出了计算设备104获得质量等级108和质量等级110,其仅是示例,而非对本公开的具体限定。计算设备104还可以依据需要获得针对任意数目语言的任意数目的质量等级。
计算设备104从第一质量等级108和第二质量等级110中确定出语音数据102的目标质量等级以指示语音数据102的语音质量。在由计算设备104生成多于两个质量等级的示例中,计算设备104从多个质量等级中选取目标质量等级。备选地或附加地,计算设备104是利用语音数据的语言来选取与该语言相对应的质量等级作为目标质量等级。
在一个示例中,该目标质量等级为分数。在另一个示例中,该目标质量等级为不同的级别,如好、中、差或A、B、C、D。上述示例仅是用于描述本公开,而非对本公开的具体限定。本领域技术人员可以依据需要设置目标质量等级的内容。In one example, the target quality rating is a score. In another example, the target quality level is different levels, such as good, medium, poor or A, B, C, D. The above examples are only used to describe the present disclosure, rather than to specifically limit the present disclosure. Those skilled in the art can set the content of the target quality level according to needs.
通过该方法,能够提高跨领域评价语音数据的质量的准确性和效率,并且提高了数据的利用率,改进了用户体验。Through the method, the accuracy and efficiency of cross-domain evaluation of the quality of voice data can be improved, the utilization rate of data can be improved, and user experience can be improved.
上面结合图1描述了本公开的实施例能够在其中被实现的示例环境100的框图。下面结合图2描述根据本公开的实施例的用于确定语音数据的质量的过程200的流程图。过程200可以在图1的计算设备104处执行。A block diagram of an
在框202处,确定所获取的语音数据的特征。例如计算设备104获取语音数据,然后对语音数据进行处理来确定所获取的语音数据的特征。At
在一些实施例中,计算设备104获取的语音数据可以为通过语音合成方式来生成的语音数据。例如,通过将文本转换为语音而生成的语音数据。在一些实施例中,计算设备104获取的语音数据是通过语音转换来生成。例如,将说话者说中文转换成说话者说英文的语音数据。上述示例仅是用于描述本公开,而非对本公开的具体限定。任意合适的语音数据均可用于确定语音的质量。通过该方式,可以快速确定合成语音或转换语音的质量。In some embodiments, the voice data acquired by the
在一些实施例中,语音数据的特征为获得的针对语音数据的向量。在一些实施例中,语音数据的特征为语音数据的上下文表示。在一些实施例中,计算设备104获取微调的语音表示模型。然后通过将经微调的语音表示模型应用于语音数据来获得特征。在一些实施例中,计算设备104将经调整的语音表示模型应用于语音数据来获得特征。该经调整的语音表示模型是利用样本语音数据及对应的质量等级在对语音数据评价模型整体进行训练时通过调整微调的语音表示模型而获得的。上述示例仅是用于描述本公开,而非对本公开的具体限定。本领域技术人员可以使用任意合适的模型的方法来获得语音数据的特征。备选地或附加地,上下文表示与对应于语音数据的文本有关。对于模型的训练将在下面结合图4和图5进一步描述。In some embodiments, the speech data is characterized by vectors obtained for the speech data. In some embodiments, the speech data is characterized as a contextual representation of the speech data. In some embodiments,
在框204处,基于特征,获取针对语音数据的第一质量等级和第二质量等级,第一质量等级与第一语言有关,第二质量等级与第二语言有关。例如计算设备104对该特征进行处理来获取针对语音数据102的第一质量等级108和第二质量等级110。At
在一些实施例中,计算设备104通过将经训练的第一解码器应用于该特征来获得第一质量等级,并且通过将经训练的第二解码器应用于特征来获得第二质量等级。经训练的第一解码器和经训练的第二解码器可用于对语音数据的特征进行处理来获得对应的质量等级。第一解码器是利用与第一语言有关的语音数据训练得到的,第二解码器是利用与第二语音有关的语音数据训练得到的。通过该方式,可以实现利用不同的解码器对由语音数据确定质量等级,从而能够实现对各种语言的语音数据进行质量检测。In some embodiments,
备选地或附加地,在利用第一解码器或第二解码器获得质量等级时还需要输入用于提供质量评价的提供者的标识ID,以获得由该提供者给出的质量等级。在第一解码器或第二解码器的推理使用过程中,该提供者为虚拟打分员。对解码器的训练过程,下面将结合图4和5进一步描述。Alternatively or additionally, when using the first decoder or the second decoder to obtain the quality grade, it is also necessary to input the identification ID of the provider providing the quality evaluation, so as to obtain the quality grade given by the provider. During inferential use of either the first decoder or the second decoder, the provider is the virtual scorer. The training process of the decoder will be further described below in conjunction with FIGS. 4 and 5 .
备选地或附加地,计算设备104内还具有针对其他语言的解码器。例如,基于特征,计算设备还可以获取针对语音数据的第三质量等级,第三质量等级与第三语言有关。Alternatively or additionally, the
在框206处,基于第一质量等级和第二质量等级,确定针对语音数据的目标质量等级,目标质量等级指示语音数据的语音质量。计算设备104利用第一质量等级和第二质量等级来确定目标质量等级。At
在一些实施例中,计算设备104从第一质量等级和第二质量等级中选取与语音数据的语言相对应的质量等级作为目标质量等级。In some embodiments, the
在一些实施例中,如果计算设备104还计算第三质量等级。因此,在确定针对语音数据的目标质量等级时还包括基于第一质量等级、第二质量等级和第三质量等级来确定目标质量等级。In some embodiments, if computing
通过该方法,能够提高跨领域评价语音数据的质量的准确性和效率,并且提高了数据的利用率,改进了用户体验。Through the method, the accuracy and efficiency of cross-domain evaluation of the quality of voice data can be improved, the utilization rate of data can be improved, and user experience can be improved.
上面结合图2描述根据本公开的实施例的用于确定语音数据的质量的过程200的流程图。下面结合图3描述一个具体的确定语音数据的语音质量的示例300。在图3描述的示例中,数据的语音质量由分数表示,语音表示模型为wav2vec2模型。The flow chart of the
如图3所示,合成语音数据302输入经训练的wav2vec2模型304。在一个示例,经训练的wav2vec2模型304是对预训练的wav2vec2模型利用一组语音数据及对应的文本进行微调来获得的。在另一个示例,经训练的wav2vec2模型304是在对预训练的wav2vec2模型进行微调后,还需要在后面训练解码器时再利用样本语音数据和样本质量等级进一步进行调整。As shown in FIG. 3 , synthesized
预训练的wav2vec2模型是利用大量的语音数据进行自学习来获得的。预训练的wav2vec2由三部分组成:由卷积神经网络组成的编码器、Transformer组成的上下文处理器和一个量化器构成。其输入为原始的语音信号,编码器能够将采样率为16kHz的语音信号每隔20ms将25ms的音段编码成一个隐向量,上下文处理器可以在当前音段上再考虑整条语音中来自其他音段的信息,将隐向量进一步处理成上下文相关的音段表征。量化器只在wav2vec2预训练时使用。该预训练的wav2vec2模型可以是由另外的提供方提供的,也可以是由用户自己利用语音数据训练得到的。The pre-trained wav2vec2 model is obtained by using a large amount of speech data for self-learning. The pre-trained wav2vec2 consists of three parts: an encoder composed of a convolutional neural network, a context processor composed of a Transformer, and a quantizer. Its input is the original speech signal. The encoder can encode the speech signal with a sampling rate of 16kHz and encode the 25ms segment every 20ms into a hidden vector. The context processor can then consider the entire speech from other segments on the current segment. The information of the segment, and the latent vector is further processed into a context-dependent representation of the segment. The quantizer is only used during wav2vec2 pre-training. The pre-trained wav2vec2 model may be provided by another provider, or may be trained by the user himself using voice data.
在对预训练的wav2vec2模型进行微调时,预训练的wav2vec2会保留卷积神经网络的编码器和Transformer的上下文处理器的参数,并在后面添加一层初始化参数的全连接层。在微调时,利用样本语音数据及对应的文本来训练预训练的wav2vec2,并且利用连接主义时间分类CTC损失函数来调整wav2vec2的参数。在以语音识别的方式来微调wav2vec2后,微调的wav2vec2就具有了识别发音的能力,并且确定的上下文表示与语音数据对应的文本内容相关。微调的wav2vec2在本方案中进行推理应用时会移除训练时增加的最后一层全连接层。When fine-tuning the pre-trained wav2vec2 model, the pre-trained wav2vec2 will retain the parameters of the encoder of the convolutional neural network and the context processor of the Transformer, and then add a fully connected layer with initialization parameters. When fine-tuning, the pre-trained wav2vec2 is trained using sample speech data and corresponding text, and the parameters of wav2vec2 are adjusted using the connectionist temporal classification CTC loss function. After fine-tuning wav2vec2 in the way of voice recognition, the fine-tuned wav2vec2 has the ability to recognize pronunciation, and the determined context representation is related to the text content corresponding to the voice data. The fine-tuned wav2vec2 will remove the last fully connected layer added during training when inferring in this solution.
经训练的wav2vec2模型304处理合成的语音数据303以获得上下文向量306。然后将获得的上下文向量306分别输入解码器A310和解码器B 314。解码器A 310用于对英文的语音数据进行打分,解码器B用于对中文的语音数据进行打分。在进行打分时,解码器A 310还需要输入英文虚拟打分员ID 308,解码器B 314还需要输入中文虚拟打分员ID 312。然后解码器A和解码器B分别获得对应的虚拟打分员的分数。然后在分数选择模块316中依据合成语音数据303的语言来选取与该语言相对应的分数作为评价合成语音数据303的分数318。例如,如果合成语音数据是英文的,则选取解码器A 310的分数,如果合成语音数据是中文的,则选取解码器B 314的分数。上述示例仅是用于描述本公开,而非对本公开的具体限定。其中解码器的数量和对应的语言可以由用户依据实际需要设置。例如,在训练解码器时可以利用多种语言的语音数据训练多个解码器。A trained
通过该方法,能够提高跨领域评价语音数据的质量的准确性和效率,并且提高了数据的利用率,改进了用户体验。Through the method, the accuracy and efficiency of cross-domain evaluation of the quality of voice data can be improved, the utilization rate of data can be improved, and user experience can be improved.
上面描述了利用语音表示模型和解码器进行的推理过程,下面结合图4描述根据本公开的实施例的用于训练解码器和语音表示模型的过程400的示意图。过程400可以由图1中的计算设备或其他的合适的计算设备执行。The reasoning process using the speech representation model and the decoder is described above, and a schematic diagram of a
该训练过程是通过利用样本语音数据和样本分数来获得经调整的语音表示模型、经训练的第一解码器和经训练的第二解码器。The training process is by using sample speech data and sample scores to obtain an adjusted speech representation model, a trained first decoder and a trained second decoder.
在框402处,计算设备104先获取微调的语音表示模型、第一解码器和第二解码器。在一个示例列中,微调的语音表示模型、第一解码器和第二解码器是从其他设备接收的。在另一个示例中,微调的语音表示模型、第一解码器和第二解码器是从计算设备104获取的。At block 402, the
在框404处,计算设备104获取第一组样本语音数据及对应的样本质量等级。计算设备会获取与该一组样本数据相对应的样本质量等级。例如,每个样本语音数据有四个样本质量等级,其中三个是由真实的提供者提供的质量等级,第四个是由虚拟提供者提供的质量等级,第四个质量等级是其他三个质量等级的平均。上述示例仅是用于描本公开,而非对本公开的具体限定。本领域技术人员可以依据需要设置质量等级的数目以及质量等级的构成。At
在框406处,计算设备104接着确定针对提供者的第一组标识,提供者用于提供样本质量等级。这些提供者包括一个虚拟提供者的标识。例如,有四个提供者标识,三个是真实的提供者标识,另一个是虚拟的提供者标识。At block 406, the
在框408处,计算设备104通过利用第一组样本语音数据、对应的样本质量等级以及第一组标识联合训练微调的语音表示模型和第一解码器或第二解码器,来获得所述经调整的语音表示模型和经训练的第一解码器或经训练的第二解码器。At block 408, the
在该训练过程中,计算设备104从一组样本语音数据中获取样本语音数据,然后通过微调的语音表示模型来获得对应的特征。然后将该特征输入与该样本语音数据中的语言相对应的解码器。此外,在解码器还需要输入针对该语音数据的样本质量等级的提供者的标识。例如提供者有4个,则该解码器会输出四个分数,然后依据输出的四个分数和四个提供者给的分数,来调整微调的语音表示模型和对应的解码器的参数。During the training process, the
上面描述了联合训练微调的语音表示模型和解码器。在一些实施例中,可以通过单独训练第一解码器和第二解码器,而不需要与微调的语音表示模型一起训练。此时将微调的语音表示模型认为是训练好的语音表示模型。计算设备104利用微调的语音表示模型确定与第二组样本语音数据相对应的样本特征。然后计算设备104确定针对提供者的第二组标识,提供者用于提供样本质量等级。接着计算设备104利用样本特征、第二组标识和样本质量等级来训练第一解码器或第二解码器,来获得经训练的第一解码器或所述经训练的第二解码器。The joint training of fine-tuned speech representation models and decoders is described above. In some embodiments, the first decoder and the second decoder can be trained independently without training together with the fine-tuned speech representation model. At this time, the fine-tuned speech representation model is regarded as a trained speech representation model.
在一些实施例中,该微调的语音表示模型可以是在由计算设备104训练的,可也是由其他计算设备训练的。在获取经微调的语音表示模型时,先获取预训练的语音表示模型。然后果获取第三组样本语音数据和对应的样本文本。接着通过利用第三组样本语音数据和对应的样本文本训练预训练的语音表示模型,来生成述微调的语音表示模型。通过该方式,可以获得与语音数据的文本有关的微调的语音表示模型,从而提高评价的准确性。In some embodiments, the fine-tuned speech representation model may be trained by computing
该预训练的语音表示模型可以是由计算设备104训练的,也可以是由其他的计算设备训练的。在生成预训练的语音表示模型时,先获取第四组样本语音数据。然后通过利用第四组样本语音数据训练语音表示模型,来生成预训练的语音表示模型。其中语音表示模型为将语音信号转换为对应的上下文表示的神经网络模型。The pre-trained speech representation model may be trained by the
备选地,上面示出了在计算设备104训练第一编码器和第二编码器的过程,也可以在其他计算设备上训练第一编码器和第二编码器。Alternatively, the process of training the first encoder and the second encoder on the
在一些实施例中,第一解码器为针对中文的解码器,第二解码器是针对英文的解码器。In some embodiments, the first decoder is a decoder for Chinese, and the second decoder is a decoder for English.
上面结合图4描述了示例训练过程,下面结合图5描述用于训练模型的示例500。An example training process is described above in conjunction with FIG. 4 , and an example 500 for training a model is described below in conjunction with FIG. 5 .
如图5所示,在训练语音质量评价模型时,该模型包括三部分,微调的wav2vec2、解码器A 512和解码器B 514。此时微调的wav2vec2移除了最后一层全连接层。在进行训练时,英文合成音频数据502或中文合成音频数据504输入微调的wav2vec2模型506,然后生成对应的上下文向量508,该上下文向量508被输入解码器A 512或解码器B 514。如果该上下文向量508是英文合成音频数据,则进入解码器A 512。同时解码器A 512还输入一组英文合成数据的打分员ID 510,基中一个是虚拟打分员ID,虚拟打分员ID对应的分数为该组中其他打分员的分数的平均数。然后获得对应于不同打分员的分数518。然后将获得的分数与针对该语音数据的样本分数进行比较来调整微调的wav2vec2和解码器A的参数。As shown in FIG. 5 , when training the speech quality evaluation model, the model includes three parts, fine-tuned wav2vec2,
如果该上下文向量508是中文合成音频数据,则进入解码器B 514,同时解码器B514还输入一组中文合成数据的打分员ID 516,基中一个是虚拟打分员的ID,虚拟打分员ID对应的分数为该组中其他打分员的分数的平均值。然后获得对应于不同打分员的分数520。然后将获得的分数与针对该语音数据的样本分数进行比较来调整微调的wav2vec2模型506和解码器B 514的参数。If this
图5描述了利用样本语音数据和样本分数同时调整微调的wav2vec2、解码器A 512和解码器B 514。在一些实施例中,可以不调整微调的wav2vec2的参数,在训练时利用微调的wav2vec2生成针对样本语音数据的上下文向量,然后结合样本分数来调节解码器A和解码器B的参数以实现对解码器A和B的训练。Figure 5 depicts wav2vec2,
图6示出了根据本公开实施例的用于确定文本中说话者的装置的示意性框图。如图6所示,装置600包括特征确定模块602,被配置为确定所获取的语音数据的特征;质量等级获取模块604,被配置为基于特征,获取针对语音数据的第一质量等级和第二质量等级,第一质量等级与第一语言有关,第二质量等级与第二语言有关;以及目标质量等级确定模块606,被配置为基于第一质量等级和第二质量等级,确定针对语音数据的目标质量等级,目标质量等级指示语音数据的语音质量。Fig. 6 shows a schematic block diagram of an apparatus for determining a speaker in a text according to an embodiment of the present disclosure. As shown in FIG. 6 , the
在一些实施例中,质量等级获取模块604包括:第一质量等级获取模块,被配置为通过将经训练的第一解码器应用于特征来获取第一质量等级;以及第二质量等级获取模块,被配置为通过将经训练的第二解码器应用于特征来获取第二质量等级。In some embodiments, the quality
在一些实施例中,特征确定模块包括:第一特征获取模块,被配置为将经调整的语音表示模型应用于语音数据来获得特征。In some embodiments, the feature determination module includes: a first feature acquisition module configured to apply the adjusted speech representation model to the speech data to obtain features.
在一些实施例中,还包括联合训练模块,被配置为通过以下模块获取经调整的语音表示模型、经训练的第一解码器和经训练的第二解码器:模型和解码器获取模块,被配置为获取微调的语音表示模型、第一解码器和第二解码器;样本质量等级获取模块,被配置为获取第一组样本语音数据及对应的样本质量等级;第一组标识确定模块,被配置为确定针对提供者的第一组标识,所述提供者用于提供所述样本质量等级;训练获得模块,被配置为通过利用所述第一组样本语音数据、对应的样本质量等级以及所述第一组标识联合训练所述微调的语音表示模型和所述第一解码器或所述第二解码器,来获得所述经调整的语音表示模型和所述经训练的第一解码器或所述经训练的第二解码器。In some embodiments, a joint training module is also included, configured to obtain the adjusted speech representation model, the trained first decoder, and the trained second decoder by: a model and decoder acquisition module, which is Configured to acquire the fine-tuned speech representation model, the first decoder and the second decoder; the sample quality level acquisition module is configured to acquire the first set of sample speech data and the corresponding sample quality level; the first set of identification determination module is configured configured to determine a first set of identifications for providers, the provider is used to provide the sample quality level; the training acquisition module is configured to use the first set of sample speech data, the corresponding sample quality level and the The first set of identifiers jointly trains the fine-tuned speech representation model and the first decoder or the second decoder to obtain the adjusted speech representation model and the trained first decoder or The trained second decoder.
在一些实施例中,特征确定模块包括:微调的语音表示模型获取模块,被配置为获取微调的语音表示模型;第二特征获取模块,被配置为通过将所述微调的语音表示模型应用于所述语音数据来获得所述特征。In some embodiments, the feature determination module includes: a fine-tuned speech representation model acquisition module configured to obtain a fine-tuned speech representation model; a second feature acquisition module configured to apply the fine-tuned speech representation model to the The speech data is used to obtain the features.
在一些实施例中,解码器获取模块,被配置为还包括通过以下模块获取所述经训练的第一解码器和所述经训练的第二解码器:样本特征确定模块,被配置为确定与第二组样本语音数据相对应的样本特征;第二组标识确定模块,被配置为确定针对提供者的第二组标识,所述提供者用于提供所述样本质量等级;解码器训练模块,被配置为利用所述样本特征、所述第二组标识和所述样本质量等级来训练所述第一解码器或所述第二解码器,来获得所述经训练的第一解码器或所述经训练的第二解码器。In some embodiments, the decoder obtaining module is configured to further include obtaining the trained first decoder and the trained second decoder by: a sample feature determination module configured to determine the same The sample features corresponding to the second group of sample speech data; the second group identification determination module is configured to determine the second group identification for the provider, and the provider is used to provide the sample quality level; the decoder training module, configured to train the first decoder or the second decoder using the sample features, the second set of identifications and the sample quality level to obtain the trained first decoder or the The trained second decoder described above.
在一些实施例中,其中获取所述微调的语音表示模型包括:预训练模型获取模块,被配置为获取预训练的语音表示模型;对应的样本文本获取模块,被配置为获取第三组样本语音数据和对应的样本文本;以及第一生成模块,被配置为通过利用所述第三组样本语音数据和所述对应的样本文本训练所述预训练的语音表示模型,来生成述微调的语音表示模型。In some embodiments, obtaining the fine-tuned speech representation model includes: a pre-training model acquisition module configured to obtain a pre-trained speech representation model; a corresponding sample text acquisition module configured to obtain a third group of sample speech data and corresponding sample text; and a first generation module configured to generate the fine-tuned speech representation by using the third set of sample speech data and the corresponding sample text to train the pre-trained speech representation model Model.
在一些实施例中,预训练模型获取模块包括:第四组样本语音数据获取模块,被配置为获取第四组样本语音数据;以及预训练的语音表示模型的生成模块,被配置为通过利用所述第四组样本语音数据训练语音表示模型,来生成所述预训练的语音表示模型。In some embodiments, the pre-training model acquisition module includes: a fourth group of sample speech data acquisition module configured to obtain a fourth group of sample speech data; and a pre-trained speech representation model generation module configured to use the The fourth group of sample voice data is used to train the voice representation model to generate the pre-trained voice representation model.
在一些实施例中,其中所述语音表示模型为将语音信号转换为对应的特征的神经网络模型。In some embodiments, the speech representation model is a neural network model that converts speech signals into corresponding features.
在一些实施例中,其中所述特征与对应于所述语音数据的文本有关。In some embodiments, wherein the features are related to text corresponding to the speech data.
在一些实施例中,装置600还包括:第三质量等级获取模块,被配置为基于所述特征,获取针对所述语音数据的第三质量等级,所述第三质量等级与第三语言有关;并且目标质量等级确定模块还包括:基于三个质量等级的确定模块,被配置为基于所述第一质量等级、所述第二质量等级和所述第三质量等级来确定述目标质量等级。In some embodiments, the
在一些实施例中,其中目标质量等级确定模块包括:选择模块,被配置为从所述第一质量等级和所述第二质量等级中选取与所述语音数据的语言相对应的质量等级作为所述目标质量等级。In some embodiments, the target quality level determination module includes: a selection module configured to select a quality level corresponding to the language of the voice data from the first quality level and the second quality level as the selected target quality level.
在一些实施例中,装置600还包括:语音数据生成模块,被配置为通过语音合成或语音转换来生成所述语音数据。In some embodiments, the
图7示出了可以用来实施本公开的实施例的示例设备700的示意性框图。图1中的计算设备104可以利用设备700来实现。如图所示,设备700包括中央处理单元(CPU)701,其可以根据存储在只读存储器(ROM)702中的计算机程序指令或者从存储单元708加载到随机访问存储器(RAM)703中的计算机程序指令,来执行各种适当的动作和处理。在RAM 703中,还可存储设备700操作所需的各种程序和数据。CPU 701、ROM 702以及RAM 703通过总线704彼此相连。输入/输出(I/O)接口705也连接至总线704。FIG. 7 shows a schematic block diagram of an
设备700中的多个部件连接至I/O接口705,包括:输入单元706,例如键盘、鼠标等;输出单元707,例如各种类型的显示器、扬声器等;存储页面708,例如磁盘、光盘等;以及通信单元709,例如网卡、调制解调器、无线通信收发机等。通信单元709允许设备700通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Multiple components in the
上文所描述的各个过程和处理,例如方法200和400,可由处理单元701执行。例如,在一些实施例中,方法200和400可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元708。在一些实施例中,计算机程序的部分或者全部可以经由ROM 702和/或通信单元709而被载入和/或安装到设备700上。当计算机程序被加载到RAM703并由CPU 701执行时,可以执行上文描述的方法200和400的一个或多个动作。The various procedures and processes described above, such as the
本公开可以是方法、装置、系统和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于执行本公开的各个方面的计算机可读程序指令。The present disclosure may be a method, apparatus, system and/or computer program product. A computer program product may include a computer-readable storage medium having computer-readable program instructions thereon for carrying out various aspects of the present disclosure.
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是——但不限于——电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如,通过光纤电缆的光脉冲)、或者通过电线传输的电信号。A computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device. A computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or flash memory), static random access memory (SRAM), compact disc read only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanically encoded device, such as a printer with instructions stored thereon A hole card or a raised structure in a groove, and any suitable combination of the above. As used herein, computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.
这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
用于执行本公开操作的计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA),该电子电路可以执行计算机可读程序指令,从而实现本公开的各个方面。Computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or Source or object code written in any combination, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages—such as the “C” language or similar programming languages. Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In cases involving a remote computer, the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as via the Internet using an Internet service provider). connect). In some embodiments, an electronic circuit, such as a programmable logic circuit, field programmable gate array (FPGA), or programmable logic array (PLA), can be customized by utilizing state information of computer-readable program instructions, which can Various aspects of the present disclosure are implemented by executing computer readable program instructions.
这里参照根据本公开实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It should be understood that each block of the flowcharts and/or block diagrams, and combinations of blocks in the flowcharts and/or block diagrams, can be implemented by computer-readable program instructions.
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理单元,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理单元执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processing unit of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。It is also possible to load computer-readable program instructions into a computer, other programmable data processing device, or other equipment, so that a series of operational steps are performed on the computer, other programmable data processing device, or other equipment to produce a computer-implemented process , so that instructions executed on computers, other programmable data processing devices, or other devices implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.
附图中的流程图和框图显示了根据本公开的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, a portion of a program segment, or an instruction that includes one or more Executable instructions. In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions.
以上已经描述了本公开的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的技术改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。Having described various embodiments of the present disclosure above, the foregoing description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and alterations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles of the various embodiments, practical applications or technical improvements over technologies in the market, or to enable other persons of ordinary skill in the art to understand the various embodiments disclosed herein.
Claims (16)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210939917.2A CN115312078B (en) | 2022-08-05 | 2022-08-05 | Method, apparatus, device and storage medium for determining the quality of voice data |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210939917.2A CN115312078B (en) | 2022-08-05 | 2022-08-05 | Method, apparatus, device and storage medium for determining the quality of voice data |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN115312078A true CN115312078A (en) | 2022-11-08 |
| CN115312078B CN115312078B (en) | 2025-02-18 |
Family
ID=83861583
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202210939917.2A Active CN115312078B (en) | 2022-08-05 | 2022-08-05 | Method, apparatus, device and storage medium for determining the quality of voice data |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN115312078B (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116230018A (en) * | 2023-02-22 | 2023-06-06 | 上海交通大学 | A Synthesized Speech Quality Evaluation Method for Speech Synthesis System |
Citations (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20040039570A1 (en) * | 2000-11-28 | 2004-02-26 | Steffen Harengel | Method and system for multilingual voice recognition |
| CN105931636A (en) * | 2015-11-30 | 2016-09-07 | 中华电信股份有限公司 | Multi-language system voice recognition device and method thereof |
| CN107221318A (en) * | 2017-05-12 | 2017-09-29 | 广东外语外贸大学 | Oral English Practice pronunciation methods of marking and system |
| CN109410915A (en) * | 2017-08-15 | 2019-03-01 | 中国移动通信集团终端有限公司 | The appraisal procedure and device of voice quality, computer readable storage medium |
| KR20190050659A (en) * | 2017-11-03 | 2019-05-13 | 주식회사 비즈모델라인 | Method for Evaluating Response of Heterogeneous Language Speaking |
| CN110415725A (en) * | 2019-07-15 | 2019-11-05 | 北京语言大学 | Use the method and system of first language data assessment second language pronunciation quality |
| CN111179916A (en) * | 2019-12-31 | 2020-05-19 | 广州市百果园信息技术有限公司 | Re-scoring model training method, voice recognition method and related device |
| CN112562736A (en) * | 2020-12-11 | 2021-03-26 | 中国信息通信研究院 | Voice data set quality evaluation method and device |
| US20210217403A1 (en) * | 2019-05-15 | 2021-07-15 | Lg Electronics Inc. | Speech synthesizer for evaluating quality of synthesized speech using artificial intelligence and method of operating the same |
| CN114242044A (en) * | 2022-02-25 | 2022-03-25 | 腾讯科技(深圳)有限公司 | Voice quality evaluation method, voice quality evaluation model training method and device |
-
2022
- 2022-08-05 CN CN202210939917.2A patent/CN115312078B/en active Active
Patent Citations (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20040039570A1 (en) * | 2000-11-28 | 2004-02-26 | Steffen Harengel | Method and system for multilingual voice recognition |
| CN105931636A (en) * | 2015-11-30 | 2016-09-07 | 中华电信股份有限公司 | Multi-language system voice recognition device and method thereof |
| CN107221318A (en) * | 2017-05-12 | 2017-09-29 | 广东外语外贸大学 | Oral English Practice pronunciation methods of marking and system |
| CN109410915A (en) * | 2017-08-15 | 2019-03-01 | 中国移动通信集团终端有限公司 | The appraisal procedure and device of voice quality, computer readable storage medium |
| KR20190050659A (en) * | 2017-11-03 | 2019-05-13 | 주식회사 비즈모델라인 | Method for Evaluating Response of Heterogeneous Language Speaking |
| US20210217403A1 (en) * | 2019-05-15 | 2021-07-15 | Lg Electronics Inc. | Speech synthesizer for evaluating quality of synthesized speech using artificial intelligence and method of operating the same |
| CN110415725A (en) * | 2019-07-15 | 2019-11-05 | 北京语言大学 | Use the method and system of first language data assessment second language pronunciation quality |
| CN111179916A (en) * | 2019-12-31 | 2020-05-19 | 广州市百果园信息技术有限公司 | Re-scoring model training method, voice recognition method and related device |
| CN112562736A (en) * | 2020-12-11 | 2021-03-26 | 中国信息通信研究院 | Voice data set quality evaluation method and device |
| CN114242044A (en) * | 2022-02-25 | 2022-03-25 | 腾讯科技(深圳)有限公司 | Voice quality evaluation method, voice quality evaluation model training method and device |
Non-Patent Citations (4)
| Title |
|---|
| MICHAEL CHINEN: "Marginal Effects of language and Individual Raters on Speech Quality Models", IEEE ACCESS, 13 September 2021 (2021-09-13) * |
| 张爽;刘加;: "语言学习机中使用韵律改进的发音质量评价方法研究", 小型微型计算机系统, no. 05, 15 May 2009 (2009-05-15) * |
| 李琳琳;李娜;张志楠;: "基于英语语音重音的自动探测", 中国科技信息, no. 11, 1 June 2013 (2013-06-01) * |
| 邱泽宇;屈丹;张连海;: "基于WaveNet的端到端语音合成方法", 计算机应用, no. 05, 21 January 2019 (2019-01-21) * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116230018A (en) * | 2023-02-22 | 2023-06-06 | 上海交通大学 | A Synthesized Speech Quality Evaluation Method for Speech Synthesis System |
Also Published As
| Publication number | Publication date |
|---|---|
| CN115312078B (en) | 2025-02-18 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN110457457B (en) | Training method of dialogue generation model, dialogue generation method and device | |
| US11093813B2 (en) | Answer to question neural networks | |
| CN111402861B (en) | Voice recognition method, device, equipment and storage medium | |
| CN112735373A (en) | Speech synthesis method, apparatus, device and storage medium | |
| CN108766414A (en) | Method, apparatus, equipment and computer readable storage medium for voiced translation | |
| WO2021227707A1 (en) | Audio synthesis method and apparatus, computer readable medium, and electronic device | |
| CN112185363B (en) | Audio processing method and device | |
| US9613616B2 (en) | Synthesizing an aggregate voice | |
| CN114783410B (en) | Speech synthesis method, system, electronic device and storage medium | |
| US20190206386A1 (en) | Method and system for text-to-speech synthesis | |
| CN109119067B (en) | Speech synthesis method and device | |
| CN110600013A (en) | Training method and device for non-parallel corpus voice conversion data enhancement model | |
| Feng et al. | ASR-GLUE: A new multi-task benchmark for asr-robust natural language understanding | |
| CN116561265A (en) | Personalized dialog generation method, model training method and device | |
| KR102663654B1 (en) | Adaptive visual speech recognition | |
| Cui et al. | Evaluation system of mobile english learning platform by using deep learning algorithm | |
| CN115171644A (en) | Speech synthesis method, apparatus, electronic device and storage medium | |
| CN115206342A (en) | A data processing method, apparatus, computer equipment and readable storage medium | |
| CN115312078A (en) | Method, apparatus, device and storage medium for determining quality of voice data | |
| CN116166793A (en) | Training method of abstract generation model, abstract generation method and device | |
| CN115346520A (en) | Method, apparatus, electronic device and medium for speech recognition | |
| Behre et al. | Streaming punctuation: A novel punctuation technique leveraging bidirectional context for continuous speech recognition | |
| CN111191451A (en) | Chinese sentence simplification method and device | |
| CN114783405B (en) | Speech synthesis method, device, electronic equipment and storage medium | |
| Liu | Multimedia interactive system of vocal music teaching based on voice recognition |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |
