CN111105786B

CN111105786B - Multi-sampling-rate voice recognition method, device, system and storage medium

Info

Publication number: CN111105786B
Application number: CN201911363288.8A
Authority: CN
Inventors: 施雨豪; 钱彦旻
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2022-10-18
Anticipated expiration: 2039-12-26
Also published as: CN111105786A

Abstract

The invention discloses a multi-sampling rate voice recognition method, a device, a system and a storage medium. Firstly, under the condition of not changing the audio sampling rate, extracting the characteristics of the audio with different sampling rates in a corresponding configuration mode according to different sampling rates, and training a neural network model by using the extracted audio. The neural network model is provided with a general speech recognition label and a sampling rate classification label, and a gradient inversion method is used for carrying out countermeasure training on the sampling rate classification label when the neural network model is trained, so that the trained multi-sampling rate speech recognition model can be self-adapted to audios with different sampling rates. Then, the multi-sampling rate speech recognition model obtained by the training of the method can be used for speech recognition, and the aim of uniformly processing the audio input with various sampling rates by using the same speech recognition model is fulfilled.

Description

A multi-sampling rate speech recognition method, device, system and storage medium

技术领域technical field

本发明涉及人工智能语音交互领域，尤其涉及一种多采样率语音识别方法、装置、系统及存储介质。The present invention relates to the field of artificial intelligence voice interaction, in particular to a multi-sampling rate voice recognition method, device, system and storage medium.

背景技术Background technique

随着人工智能和电子通信技术的不断发展和进步，智能语音交互技术日益普及应用在多个产品领域，包括智能客服，呼叫中心，智能音箱和智能手表等等。With the continuous development and progress of artificial intelligence and electronic communication technology, intelligent voice interaction technology is increasingly popularized and applied in many product fields, including intelligent customer service, call center, smart speaker and smart watch, etc.

然而，虽然同是语音识别，然而在不同的应用场景下，语音采样率却不尽相同。如果需要在一个系统中处理不同多采样率的语音样本，现多采用以下方案：1)通过升/降采样将音频的采样率统一，以此来统一成一个语音识别系统。这一方案会改变原始音频的性质，导致语音识别的准确率下降。2)部署多个语音识别系统，在输出结果后根据置信度或者混淆度来进行筛选，选出最合适的那个结果。这一方案则存在资源利用效率低，运维成本高的问题。However, although the same is speech recognition, in different application scenarios, the speech sampling rate is not the same. If it is necessary to process speech samples of different multi-sampling rates in one system, the following schemes are mostly adopted: 1) Unify the sampling rates of the audio through up/down sampling, so as to unify into a speech recognition system. This scheme changes the nature of the original audio, resulting in a decrease in the accuracy of speech recognition. 2) Deploy multiple speech recognition systems, and screen the results according to the confidence or confusion after outputting the results, and select the most suitable result. This solution has the problems of low resource utilization efficiency and high operation and maintenance costs.

发明内容SUMMARY OF THE INVENTION

针对以上问题，本发明人创造性地提供一种多采样率语音识别的方法、装置、系统及存储介质。In view of the above problems, the present inventor creatively provides a method, device, system and storage medium for speech recognition with multiple sampling rates.

根据本发明实施例第一方面，一种多采样率语音识别模型的训练方法，该方法包括：获取至少两种不同采样率的音频特征；将音频特征作为输入对神经网络模型进行训练，其中，音频特征标注有语音识别标签和采样率分类标签。According to the first aspect of the embodiments of the present invention, a method for training a multi-sampling rate speech recognition model, the method includes: acquiring at least two audio features with different sampling rates; using the audio features as input to train a neural network model, wherein, Audio features are annotated with speech recognition labels and sample rate classification labels.

根据本发明一实施方式，其中，获取至少两种不同采样率的音频特征，包括：接收至少两种不同采样率的音频输入；根据音频输入所属的采样率分类设定特征提取的配置信息；使用配置信息对音频进行特征提取得到至少两种不同采样率的音频特征。According to an embodiment of the present invention, acquiring at least two audio features with different sampling rates includes: receiving audio inputs with at least two different sampling rates; classifying and setting configuration information for feature extraction according to the sampling rate to which the audio input belongs; using The configuration information performs feature extraction on the audio to obtain at least two audio features with different sampling rates.

根据本发明一实施方式，其中，对神经网络模型进行训练，包括：针对语音识别标签对神经网络模型进行正常训练，并针对采样率分类标签对神经网络模型进行对抗训练。According to an embodiment of the present invention, the training of the neural network model includes: normal training of the neural network model for the speech recognition label, and adversarial training of the neural network model for the sampling rate classification label.

根据本发明一实施方式，其中，针对采样率分类标签对神经网络模型进行对抗训练包括：依据交叉熵训练准则，针对采样率分类标签对神经网络模型进行对抗训练。According to an embodiment of the present invention, the adversarial training of the neural network model with respect to the sampling rate classification label includes: performing adversarial training on the neural network model with respect to the sampling rate classification label according to a cross-entropy training criterion.

根据本发明一实施方式，其中，进行对抗训练包括：采用梯度置反后进行反传的方式进行对抗训练。According to an embodiment of the present invention, performing adversarial training includes: performing adversarial training by inverting gradients and then performing backpropagation.

根据本发明实施例第二方面，一种多采样率语音识别的方法，该方法包括：接收音频特征；将音频特征输入给多采样率语音识别模型得到语音识别结果，其中，多采样率语音识别模型是执行上述多采样语音识别模型的训练方法的任一项方法训练得到的。According to a second aspect of the embodiments of the present invention, a method for multi-sampling rate speech recognition, the method includes: receiving audio features; inputting the audio features into a multi-sampling rate speech recognition model to obtain a speech recognition result, wherein the multi-sampling rate speech recognition The model is obtained by performing any one of the above-mentioned training methods for the multi-sample speech recognition model.

根据本发明实施例第三方面，一种多采样率语音识别模型的训练装置，该装置包括：音频特征获取模块，用于获取至少两种不同采样率的音频特征；神经网络模型训练模块，用于将音频特征作为输入对神经网络模型进行训练，其中，音频特征标注有语音识别标签和采样率分类标签。According to a third aspect of the embodiments of the present invention, a multi-sampling rate speech recognition model training device includes: an audio feature acquisition module for acquiring at least two audio features with different sampling rates; a neural network model training module for using It is used to train a neural network model with audio features as input, where audio features are labeled with speech recognition labels and sample rate classification labels.

根据本发明一实施方式，其中，音频特征获取模块包括：音频输入接收单元，用于接收至少两种不同采样率的音频输入；特征提取配置单元，用于根据音频输入所属的采样率分类设定特征提取的配置信息；音频特征提取单元，用于使用配置信息对音频进行特征提取得到至少两种不同采样率的音频特征。According to an embodiment of the present invention, the audio feature acquisition module includes: an audio input receiving unit for receiving at least two audio inputs with different sampling rates; a feature extraction configuration unit for classifying and setting according to the sampling rate to which the audio input belongs Configuration information for feature extraction; an audio feature extraction unit, configured to perform feature extraction on audio using the configuration information to obtain at least two audio features with different sampling rates.

根据本发明一实施方式，其中，神经网络模型训练模块包括：语音识别训练单元，用于针对语音识别标签对神经网络模型进行正常训练；采样率分类训练单元，用于针对采样率分类标签对神经网络模型进行对抗训练。According to an embodiment of the present invention, the neural network model training module includes: a speech recognition training unit for performing normal training on the neural network model for speech recognition labels; Adversarial training of the network model.

根据本发明一实施方式，其中，采样率分类训练单元具体用于依据交叉熵训练准则，针对采样率分类标签对神经网络模型进行对抗训练。According to an embodiment of the present invention, the sampling rate classification training unit is specifically configured to perform adversarial training on the neural network model with respect to the sampling rate classification labels according to the cross-entropy training criterion.

根据本发明一实施方式，其中，采样率分类训练单元包括：梯度置反子单元，用于采用梯度置反后进行反传的方式进行对抗训练。According to an embodiment of the present invention, the sampling rate classification training unit includes: a gradient inversion subunit, which is used for adversarial training in a manner of performing backpropagation after gradient inversion.

根据本发明实施例第四方面，提供一种多采样率语音识别装置，该装置包括：音频特征接收模块，用于接收音频特征；语音识别模块，用于将音频特征输入给多采样率语音识别模型得到语音识别结果，其中，多采样率语音识别模型是执行上述多采样率语音识别模型的训练方法任一项方法训练得到的。According to a fourth aspect of the embodiments of the present invention, there is provided a multi-sampling rate speech recognition device, the device comprising: an audio feature receiving module for receiving audio features; a speech recognition module for inputting the audio features into the multi-sampling rate speech recognition The model obtains a speech recognition result, wherein the multi-sampling rate speech recognition model is obtained by performing any one of the above-mentioned training methods for the multi-sampling rate speech recognition model.

根据本发明实施例第五方面，提供一种多采样率语音识别系统，该系统包括处理器和存储器，其中，存储器中存储有计算机程序指令，计算机程序指令被处理器运行时用于执行上述任一项的方法。According to a fifth aspect of the embodiments of the present invention, a multi-sampling rate speech recognition system is provided, the system includes a processor and a memory, wherein the memory stores computer program instructions, and the computer program instructions are used by the processor to execute any of the above when run by the processor. a method.

根据本发明实施例第六方面，提供一种计算机存储介质，存储介质包括一组计算机可执行指令，当指令被执行时用于执行上述任一项的方法。According to a sixth aspect of the embodiments of the present invention, a computer storage medium is provided, the storage medium includes a set of computer-executable instructions, which are used to perform any one of the above methods when the instructions are executed.

本发明实施例提供一种多采样率语音识别方法、装置、系统及存储介质。首先，在不改变音频采样率的条件下，根据采样率不同进行相应配置的方式对不同采样率的音频进行特征提取，并利用所提取到的音频对神经网络模型进行训练。该神经网络模型除了具有一般的语音识别标签之外，还添加了采样率分类标签，并在训练该神经网络模型时会使用梯度置反的方法对采样率分类标签进行对抗训练，从而使训练得到的多采样率语音识别模型能够自主适应不同采样率的音频。之后，就可以使用上述方法训练得到的多采样率语音识别模型进行语音识别，实现用同一语音识别模型统一处理多种采样率的音频输入的目标。这样，即能够保留原始音频的性质，还大大节约了语音识别系统的训练成本和维护成本。此外，不同采样率的音频输入的数据可以相互融合，进一步提高了数据的多样性，并使可用数据成倍增长。Embodiments of the present invention provide a multi-sampling rate speech recognition method, device, system, and storage medium. Firstly, under the condition of not changing the audio sampling rate, feature extraction is performed on audios with different sampling rates by configuring correspondingly according to different sampling rates, and the neural network model is trained by using the extracted audio. In addition to the general speech recognition label, the neural network model also adds a sampling rate classification label, and when training the neural network model, the gradient inversion method is used to conduct adversarial training on the sampling rate classification label, so that the training results The multi-sampling rate speech recognition model is able to autonomously adapt to different sampling rates of audio. Afterwards, the multi-sampling rate speech recognition model trained by the above method can be used for speech recognition, so as to achieve the goal of uniformly processing audio input with various sampling rates by the same speech recognition model. In this way, the properties of the original audio can be preserved, and the training cost and maintenance cost of the speech recognition system can be greatly saved. In addition, audio input data of different sampling rates can be fused with each other, further increasing the diversity of data and multiplying the available data.

附图说明Description of drawings

通过参考附图阅读下文的详细描述，本发明示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中，以示例性而非限制性的方式示出了本发明的若干实施方式，其中：The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the accompanying drawings, several embodiments of the present invention are shown by way of example and not limitation, wherein:

在附图中，相同或对应的标号表示相同或对应的部分。In the drawings, the same or corresponding reference numerals denote the same or corresponding parts.

图1为本发明实施例多采样率语音识别模型的训练方法的实现流程示意图；FIG. 1 is a schematic flowchart of an implementation of a training method for a multi-sampling rate speech recognition model according to an embodiment of the present invention;

图2为本发明一应用多采样率语音识别模型的训练方法的具体实现流程示意图；2 is a schematic flow chart of a specific implementation of a training method using a multi-sampling rate speech recognition model according to the present invention;

图3为本发明实施例多采样率语音识别方法的实现流程示意图；FIG. 3 is a schematic flowchart of an implementation of a multi-sampling rate speech recognition method according to an embodiment of the present invention;

图4为本发明一应用多采样率语音识别方法的具体实现流程示意图；FIG. 4 is a schematic diagram of a specific implementation flow diagram of an application of a multi-sampling rate speech recognition method according to the present invention;

图5为本发明实施例多采样率语音识别模型的训练装置的组成结构示意图；5 is a schematic diagram of the composition and structure of a training device for a multi-sampling rate speech recognition model according to an embodiment of the present invention;

图6为本发明实施例多采样率语音识别装置的组成结构示意图。FIG. 6 is a schematic diagram of the composition and structure of a multi-sampling rate speech recognition apparatus according to an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、特征、优点能够更加的明显和易懂，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而非全部实施例。基于本发明中的实施例，本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the objectives, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described The embodiments are only some of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative efforts shall fall within the protection scope of the present invention.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。而且，描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外，在不相互矛盾的情况下，本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of this specification, description with reference to the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples", etc., mean specific features described in connection with the embodiment or example , structure, material or feature is included in at least one embodiment or example of the present invention. Furthermore, the particular features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, those skilled in the art may combine and combine the different embodiments or examples described in this specification, as well as the features of the different embodiments or examples, without conflicting each other.

此外，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或隐含地包括至少一个该特征。在本发明的描述中，“多个”的含义是两个或两个以上，除非另有明确具体的限定。In addition, the terms "first" and "second" are only used for descriptive purposes, and should not be construed as indicating or implying relative importance or implying the number of indicated technical features. Thus, a feature delimited with "first", "second" may expressly or implicitly include at least one of that feature. In the description of the present invention, "plurality" means two or more, unless otherwise expressly and specifically defined.

图1示出了本发明实施例训练多采样率语音识别模型的方法的实现流程。参考图1，该方法包括：操作110，获取至少两种不同采样率的音频特征；操作120，将音频特征作为输入对神经网络模型进行训练，其中，音频特征标注有语音识别标签和采样率分类标签。FIG. 1 shows an implementation flow of a method for training a multi-sampling rate speech recognition model according to an embodiment of the present invention. Referring to FIG. 1, the method includes: operation 110, obtaining audio features of at least two different sampling rates; operation 120, using the audio features as input to train a neural network model, wherein the audio features are marked with speech recognition labels and sampling rate classifications Label.

在操作110中，此处音频特征数据是训练神经网络模型用的训练数据。可以通过对音频输入进行特征提取得到的，也可以是从音频特征供应商或原料库获取的。需要说明的是，训练数据与实际应用中的数据不同，是经过标签标注处理过的数据。这里的标签就是要预测的内容，标签的值就是期望值。通过实际预测值与期望值进行对比，就能不断修正模型使实际预测值与期望值之间的误差趋于最小，从而得到一个准确率较高、可以投入实际应用的模型。为了使训练得到的语音识别系统可以识别多种采样率不同的音频数据，在训练神经网络模型的时候，最少要使用两种不同采样率的音频输入。而且，用于训练神经网络模型的训练数据最好与实际应用中需要识别的音频数据采用相同的采样率，这样在之后的应用中，对不同采样率的音频输入进行语音识别的准确率会更高，应用效果也更好。使用本发明实施例多采样率语音识别模型的训练方法，对训练数据中各种不同采样率的数据量分布是没有要求的，可以是任意比例的不同采样率数据。In operation 110, the audio feature data here is training data for training the neural network model. It can be obtained by feature extraction of audio input, or it can be obtained from audio feature suppliers or raw material library. It should be noted that the training data is different from the data in practical applications, it is the data that has been labelled. The label here is the content to be predicted, and the value of the label is the expected value. By comparing the actual predicted value with the expected value, the model can be continuously revised to minimize the error between the actual predicted value and the expected value, thereby obtaining a model with a high accuracy rate that can be put into practical application. In order to enable the trained speech recognition system to recognize a variety of audio data with different sampling rates, at least two audio inputs with different sampling rates should be used when training the neural network model. Moreover, the training data used to train the neural network model should preferably use the same sampling rate as the audio data that needs to be recognized in practical applications, so that in subsequent applications, the accuracy of speech recognition for audio input with different sampling rates will be higher. high, the application effect is also better. Using the training method of the multi-sampling rate speech recognition model according to the embodiment of the present invention, there is no requirement for the data volume distribution of various sampling rates in the training data, and it can be data of different sampling rates in any proportion.

在操作120中，利用操作110中获取到的音频特征就可以对神经网络进行训练了。正如前文提到的，训练数据携带有标签，标签的值就是期望值。通过比较训练得出的实际预测值与期望值，就能不断修正模型使实际预测值与期望值趋于最小，直至得到一个准确率较高、可以实际应用的模型，这就是对神经网络进行训练的过程。与其他用于语音识别的神经网络模型不同，本发明实施例还增设了采样率分类标签。这也意味着，这一神经网络模型不仅会预测语音识别的结果，还会预测该音频输入所属的采样率分类。这里，该标签的使用是为了更好地融合各种不同采样率的音频数据，使模型可用的数据量更大，且不同来源的数据也使得数据的多样性更充分，预测的结果也相对更准确。In operation 120, the neural network can be trained using the audio features obtained in operation 110. As mentioned earlier, the training data carries labels, and the value of the label is the expected value. By comparing the actual predicted value and the expected value obtained from the training, the model can be continuously revised to minimize the actual predicted value and the expected value until a model with high accuracy and practical application is obtained. This is the process of training the neural network. . Different from other neural network models for speech recognition, the embodiment of the present invention also adds a sampling rate classification label. This also means that this neural network model will not only predict the outcome of speech recognition, but also the sample rate classification to which the audio input belongs. Here, the use of this tag is to better integrate audio data of various sampling rates, so that the amount of data available to the model is larger, and the data from different sources also makes the data more diverse, and the prediction results are relatively better. precise.

这里的音频输入，可以是音频采集设备采集到的原始音频，但最好是经过语音信号处理过的音频输入。经过语音信号处理过的音频更为清晰，易于识别，训练效果更佳。这些音频输入也可以从数据提供商处或音频数据库中获取。音频输入的采样率分类是预先设定好的，而音频输入所属的采样率分类也是已知的。在此处，只需通过一些输入参数或配置信息进行指定音频输入所属的采样率分类即可。采样率分类的定义是在建模时进行的，这里指定的值要与建立采样率分类标签时定义的值一致。例如，假设在建模时将第一采样率音频的分类定义为1，那么在此处使用第一采样率音频进行训练时，需要指定的采样率分类就是1。从不同音频采样率的音频输入种提取特征的过程和参数会略有不同。为此，还需要针对不同音频采样率的音频输入设置一些音频处理的参数以便拿到该采样率音频所特有的、最有代表性的特征，这些参数的设定值就是为特征提取而设定的配置信息。例如，在进行音频特征提取的过程中需要设定一个最高频率，对于采样率为8K的音频，可以将这一参数设置为4000，而对于采样率为16K的音频，则可以将这一参数设置为8000。某个特定的采样率音频所使用的配置可以根据该采样率音频的音频性质预先设定好，在这一操作中获取预设值进行配置即可。而音频特征的提取可以使用任何适用的方法，本发明实施例主要采用当前比较通用的F-bank音频特征提取方法。这里提取到的音频特征包含用于进行语音识别的语音识别特征，比如音素、字等，还包含有采样率分类值。The audio input here can be the original audio collected by the audio collection device, but preferably the audio input that has been processed by the voice signal. The audio processed by the speech signal is clearer, easier to recognize, and has better training effect. These audio inputs can also be obtained from data providers or audio databases. The sample rate class of the audio input is preset, and the sample rate class to which the audio input belongs is known. Here, just specify the sample rate classification to which the audio input belongs through some input parameters or configuration information. The definition of the sample rate classification is done during modeling, and the value specified here should be consistent with the value defined when the sample rate classification label is established. For example, assuming that the classification of the first sampling rate audio is defined as 1 during modeling, when training with the first sampling rate audio here, the specified sampling rate classification is 1. The process and parameters for extracting features from audio inputs with different audio sampling rates are slightly different. To this end, it is also necessary to set some audio processing parameters for the audio input of different audio sampling rates in order to obtain the unique and most representative features of the audio of the sampling rate. The setting values of these parameters are set for feature extraction. configuration information. For example, in the process of audio feature extraction, a maximum frequency needs to be set. For audio with a sampling rate of 8K, this parameter can be set to 4000, and for audio with a sampling rate of 16K, this parameter can be set. is 8000. The configuration used for a specific sampling rate audio can be preset according to the audio properties of the sampling rate audio, and the preset value can be obtained in this operation for configuration. The audio feature extraction may use any applicable method, and the embodiment of the present invention mainly adopts the current relatively common F-bank audio feature extraction method. The audio features extracted here include speech recognition features used for speech recognition, such as phonemes, words, etc., and also include sampling rate classification values.

这里，针对语音识别标签进行的正常训练与其他语音识别系统的实现过程无异，一般采样基于神经网络的时序类分类(Connectionist temporal classification，CTC)准则、最大互信息(Maximum Mutual Information，MM)准则或是其他任何适用的准则来进行训练。而针对采样率分类标签对神经网络模型进行对抗训练，则是本发明实施例的突出特点。这里的对抗训练，也可以称作干扰训练，旨在让模型不能准确区分采样率分类，从而充分利用不同采样率音频数据，在更多样化的数据基础上进行预测，这样的预测结果具有自适应性，也更加准确，从而实现用同一模型对多采样率音频输入进行统一处理，输出语音识别结果。Here, the normal training for speech recognition labels is no different from the implementation process of other speech recognition systems. Generally, the neural network-based connectionist temporal classification (CTC) criterion and the maximum mutual information (Maximum Mutual Information, MM) criterion are sampled. Or any other applicable criteria for training. The adversarial training of the neural network model for the sampling rate classification label is a prominent feature of the embodiment of the present invention. The adversarial training here, which can also be called interference training, aims to make the model unable to accurately distinguish the sampling rate classification, so as to make full use of the audio data of different sampling rates and make predictions on the basis of more diverse data. Such prediction results have automatic It is more adaptable and more accurate, so that the same model can be used to uniformly process audio input with multiple sampling rates and output speech recognition results.

交叉熵主要用于消岐领域，通过计算先验信息和后验信息的交叉熵，并以交叉熵指导对歧义的进行辨识和消除。特别适用于计算机自适应实现。使用交叉熵准则针对采样率分类标签对神经网络模型进行训练，则通过这种训练所得到的多采样率语音识别模型的可以自主的适应多个采样率数据。Cross-entropy is mainly used in the field of disambiguation, by calculating the cross-entropy of prior information and posterior information, and using the cross-entropy to guide the identification and elimination of ambiguity. Especially suitable for computer adaptive implementation. Using the cross-entropy criterion to train the neural network model for the sampling rate classification labels, the multi-sampling rate speech recognition model obtained through this training can autonomously adapt to multiple sampling rate data.

这里的梯度可以简单地理解为是实际预测值与期望值之间的误差，可以作为调整神经网络模型参数的一个依据。梯度越小就认为预测越准确，神经网络模型就越成熟。在使用神经网络模型预测采样率分类时，为了使其忽略采样率的区别，会进行梯度置反的操作。即，用梯度乘以负一得到其相反数，然后回传该相反数来训练模型。这也就是对抗训练所采取的干扰方式，通过这种方式，神经网络模型就不能准确地预测采样率分类，也就不会因为采样率不同而区别对待，从而大大提高了不同采样率数据的贡献率，并可以对不同采样率的音频特征进行统一处理。这也是对抗训练可以让语音识别系统无视采样率带来的分歧，达到很好的识别效果的关键所在。The gradient here can be simply understood as the error between the actual predicted value and the expected value, which can be used as a basis for adjusting the parameters of the neural network model. The smaller the gradient, the more accurate the prediction is, and the more mature the neural network model is. When using the neural network model to predict the sampling rate classification, in order to make it ignore the difference of the sampling rate, the gradient inversion operation will be performed. That is, multiply the gradient by negative one to get its inverse, then pass back that inverse to train the model. This is the interference method used in adversarial training. In this way, the neural network model cannot accurately predict the sampling rate classification, and it will not be treated differently because of different sampling rates, thus greatly improving the contribution of data with different sampling rates. rate, and can uniformly process audio features of different sampling rates. This is also the key to adversarial training, which allows the speech recognition system to ignore the differences caused by the sampling rate and achieve a good recognition effect.

下面就结合图2，具体说明本发明一应用训练多采样率语音识别模型的方法的具体流程。如图2所示的应用场景中，系统主要接收三种采样率音频数据，分别是第一采样率音频数据、第二采样率音频数据和第三采样率音频数据。在为该系统训练多采样率语音识别模型时，主要采用以下步骤：In the following, a specific flow of a method for training a multi-sampling rate speech recognition model according to the present invention will be described in detail with reference to FIG. 2 . In the application scenario shown in FIG. 2 , the system mainly receives audio data of three sampling rates, namely audio data of the first sampling rate, audio data of the second sampling rate and audio data of the third sampling rate. When training a multi-sample rate speech recognition model for this system, the following steps are mainly used:

步骤201，接收音频输入；Step 201, receiving audio input;

这里的音频输入是带有训练标签的训练数据，而且是至少两种以上不同采样率的训练数据。The audio input here is training data with training labels and at least two or more different sampling rates of training data.

步骤220，判断采样率分类；Step 220, judging the sampling rate classification;

音频输入所属的采样率分类是已知信息，并可通过某一参数或配置信息获取，此处需要判断该音频输入具体是哪一分类，并根据分类决定下一步骤。若该音频输入的采样率是第一采样率，则继续步骤230；若该音频输入的采样率是第二采样率，则继续步骤240；若该音频输入的采样率是第三采样率，则继续步骤250。The sampling rate classification to which the audio input belongs is known information and can be obtained through a certain parameter or configuration information. Here, it is necessary to determine which classification the audio input belongs to, and determine the next step according to the classification. If the sampling rate of the audio input is the first sampling rate, proceed to step 230; if the sampling rate of the audio input is the second sampling rate, proceed to step 240; if the sampling rate of the audio input is the third sampling rate, then Proceed to step 250.

步骤230，第一采样率音频特征提取；Step 230, first sampling rate audio feature extraction;

所接收的音频输入是第一采样率音频，使用第一采样率音频所使用的配置进行特征提取。其中，第一采样率音频所使用的配置可以根据第一采样率音频的音频性质预先设定好，在这一步骤中，只需获取预设值进行配置。配置好之后就可以进行音频特征提取，得到需要的音频特征并输入给神经网络模型进行训练。The received audio input is the first sample rate audio, and the configuration used for the first sample rate audio is used for feature extraction. The configuration used for the audio of the first sampling rate may be preset according to the audio properties of the audio of the first sampling rate. In this step, only a preset value is required for configuration. After configuration, audio feature extraction can be performed to obtain the required audio features and input them to the neural network model for training.

步骤240，第二采样率音频特征提取；Step 240, second sampling rate audio feature extraction;

所接收的音频输入是第二采样率音频，使用第二采样率音频所使用的配置进行特征提取。其中，第二采样率音频所使用的配置可以根据第二采样率音频的音频性质预先设定好，在这一步骤中，只需获取预设值进行配置。配置好之后就可以进行音频特征提取，得到需要的音频特征并输入给神经网络模型进行训练。The received audio input is the second sample rate audio, and the configuration used for the second sample rate audio is used for feature extraction. The configuration used for the audio of the second sampling rate may be preset according to the audio properties of the audio of the second sampling rate. In this step, only a preset value is required for configuration. After configuration, audio feature extraction can be performed to obtain the required audio features and input them to the neural network model for training.

步骤250，第三采样率音频特征提取；Step 250, the third sampling rate audio feature extraction;

所接收的音频输入是第三采样率音频，使用第三采样率音频所使用的配置进行特征提取。其中，第三采样率音频所使用的配置可以根据第三采样率音频的音频性质预先设定好，在这一步骤中，只需获取预设值进行配置。配置好之后就可以进行音频特征提取，得到需要的音频特征并输入给神经网络模型进行训练。The received audio input is the third sample rate audio, and the configuration used for the third sample rate audio is used for feature extraction. The configuration used for the audio of the third sampling rate may be preset according to the audio properties of the audio of the third sampling rate. In this step, only the preset value is required for configuration. After configuration, audio feature extraction can be performed to obtain the required audio features and input them to the neural network model for training.

步骤260，训练神经网络模型；Step 260, train the neural network model;

这里主要是使用上述步骤所获取的音频特征对神经网络进行训练，并根据预测结果调整神经网络模型的参数，使预测误差值不断减小的过程。在本发明实施例的这一应用中，主要采用梯度下降法。Here is mainly the process of using the audio features obtained in the above steps to train the neural network, and adjusting the parameters of the neural network model according to the prediction results, so that the prediction error value is continuously reduced. In this application of the embodiment of the present invention, the gradient descent method is mainly used.

步骤270，针对采样率分类标签进行梯度置反并反传；Step 270, inverting the gradient for the sampling rate classification label and backpropagating it;

如前这里，为了融合不同采样率音频数据，在训练时会对梯度进行置反，采用对抗学习方法。As before, in order to fuse audio data with different sampling rates, the gradient is inverted during training, and an adversarial learning method is used.

步骤280，针对语音识别标签进行梯度反传。In step 280, gradient backpropagation is performed for the speech recognition label.

这里的梯度反传是正常反传，不进行置反操作。The gradient backpropagation here is normal backpropagation, and no inversion operation is performed.

需要说明的是，训练神经网络模型的过程是个无限次循环的过程，可根据实际应用需要决定模型成熟的标准，并将认为已经较为成熟的多采样率语音识别模型应用于实际生产环境中进行语音识别。It should be noted that the process of training the neural network model is an infinite loop process, and the mature standard of the model can be determined according to the actual application needs, and the multi-sampling rate speech recognition model that is considered to be relatively mature is applied to the actual production environment for speech recognition. identify.

本发明实施例在执行以上多采样率语音识别模型的训练方法得到一个较为成熟的多采样率语音识别模型后，本发明实施例还提供一种语音识别方法。如图3所示，该方法包括：操作310，接收音频特征；操作320，将音频特征输入给多采样率语音识别模型得到语音识别结果，其中，多采样率语音识别模型是执行上述多采样语音识别模型的训练方法的任一项方法训练得到的。Embodiments of the present invention further provide a speech recognition method after a relatively mature multi-sampling rate speech recognition model is obtained by executing the above multi-sampling rate speech recognition model training method. As shown in FIG. 3 , the method includes: operation 310, receiving audio features; operation 320, inputting the audio features into a multi-sampling rate speech recognition model to obtain a speech recognition result, wherein the multi-sampling rate speech recognition model is to execute the above-mentioned multi-sampling speech recognition model. It is obtained by training any one of the training methods of the recognition model.

在操作310中，这里的音频特征是从实际生产环境下不带标注的新数据中提取到的。这里所说的新数据是所用训练数据中某一采样率分类的音频数据，例如第一采样率音频数据。In operation 310, the audio features here are extracted from unlabeled new data in an actual production environment. The new data mentioned here is the audio data classified by a certain sampling rate in the used training data, for example, the audio data of the first sampling rate.

在操作320中，这里使用的是已经训练好的多采样率语音识别模型，根据输入可直接获取一个预测结果，即语音识别的结果。In operation 320, the trained multi-sampling rate speech recognition model is used here, and a prediction result, that is, a speech recognition result, can be directly obtained according to the input.

需要说明的是，用实际生产环境中新输入的、不带标签的音频特征进行预测之后不会再进行梯度反传操作，但可以根据预测结果进行标签标注变成新的训练数据，用于之后半监督学习。It should be noted that after using the newly input unlabeled audio features in the actual production environment for prediction, the gradient back-propagation operation will not be performed, but the labels can be labeled according to the prediction results to become new training data for later use. Semi-supervised learning.

下面结合图4，详细讲述如何使用本发明一应用实例使用图2所示的步骤训练得到的多采样率语音识别模型进行多采样率语音识别的方法。如图4所示，可以使用如下步骤进行多采样率语音识别：The following describes in detail a method for performing multi-sampling rate speech recognition using the multi-sampling rate speech recognition model trained in the steps shown in FIG. 2 using an application example of the present invention. As shown in Figure 4, multi-sample rate speech recognition can be performed using the following steps:

步骤410，接收音频输入；Step 410, receiving audio input;

这里的音频输入是不带有训练标签的、生产环境下新输入的音频数据。The audio input here is the newly input audio data in the production environment without training labels.

步骤420，判断采样率分类；Step 420, judging the sampling rate classification;

音频输入所属的采样率分类是已知信息，并可通过某一参数或配置信息获取，此处需要判断该音频输入具体是哪一分类，并根据分类决定下一步骤。若该音频输入的采样率是第一采样率，则继续步骤430；若该音频输入的采样率是第二采样率，则继续步骤440；若该音频输入的采样率是第三采样率，则继续步骤450。The sampling rate classification to which the audio input belongs is known information and can be obtained through a certain parameter or configuration information. Here, it is necessary to determine which classification the audio input belongs to, and determine the next step according to the classification. If the sampling rate of the audio input is the first sampling rate, proceed to step 430; if the sampling rate of the audio input is the second sampling rate, proceed to step 440; if the sampling rate of the audio input is the third sampling rate, then Proceed to step 450.

步骤430，第一采样率音频特征提取；Step 430, first sampling rate audio feature extraction;

所接收的音频输入是第一采样率音频，使用第一采样率音频所使用的配置进行特征提取。其中，第一采样率音频所使用的配置可以根据第一采样率音频的音频性质预先设定好，在这一步骤中，只需获取预设值进行配置。配置好之后就可以进行音频特征提取，得到需要的音频特征并输入给多采样率语音识别模型进行语音识别。The received audio input is the first sample rate audio, and the configuration used for the first sample rate audio is used for feature extraction. The configuration used for the audio of the first sampling rate may be preset according to the audio properties of the audio of the first sampling rate. In this step, only a preset value is required for configuration. After the configuration is complete, audio feature extraction can be performed to obtain the required audio features and input them to the multi-sampling rate speech recognition model for speech recognition.

步骤440，第二采样率音频特征提取；Step 440, second sampling rate audio feature extraction;

所接收的音频输入是第二采样率音频，使用第二采样率音频所使用的配置进行特征提取。其中，第二采样率音频所使用的配置可以根据第二采样率音频的音频性质预先设定好，在这一步骤中，只需获取预设值进行配置。配置好之后就可以进行音频特征提取，得到需要的音频特征并输入多采样率语音识别模型进行语音识别。The received audio input is the second sample rate audio, and the configuration used for the second sample rate audio is used for feature extraction. The configuration used for the audio of the second sampling rate may be preset according to the audio properties of the audio of the second sampling rate. In this step, only a preset value is required for configuration. After the configuration is complete, audio feature extraction can be performed to obtain the required audio features and input the multi-sampling rate speech recognition model for speech recognition.

步骤450，第三采样率音频特征提取；Step 450, the third sampling rate audio feature extraction;

所接收的音频输入是第三采样率音频，使用第三采样率音频所使用的配置进行特征提取。其中，第三采样率音频所使用的配置可以根据第三采样率音频的音频性质预先设定好，在这一步骤中，只需获取预设值进行配置。配置好之后就可以进行音频特征提取，得到需要的音频特征并输入给多采样率语音识别模型进行语音识别。The received audio input is the third sample rate audio, and the configuration used for the third sample rate audio is used for feature extraction. The configuration used for the audio of the third sampling rate may be preset according to the audio properties of the audio of the third sampling rate. In this step, only the preset value is required for configuration. After the configuration is complete, audio feature extraction can be performed to obtain the required audio features and input them to the multi-sampling rate speech recognition model for speech recognition.

步骤460，使用多采样率语音识别模型进行语音识别；Step 460, using a multi-sampling rate speech recognition model to perform speech recognition;

这里使用的预测模型是已经训练好的多采样率语音识别模型，可以用于生产环境下的语音识别。The prediction model used here is a multi-sample rate speech recognition model that has been trained and can be used for speech recognition in production environments.

步骤470，输出语音识别结果。Step 470, output the speech recognition result.

这里可以根据步骤430、步骤440或步骤450提取到的音频特征，使用采样率语音识别模型进行预测会得到一个预测结果，这个预测结果就是语音识别结果。Here, according to the audio features extracted in step 430, step 440 or step 450, a prediction result can be obtained by using the sampling rate speech recognition model to obtain a prediction result, and the prediction result is the speech recognition result.

进一步地，本发明实施例还提供一种多采样率语音识别模型的训练装置。如图5所示，该设备50包括：音频特征获取模块501，用于获取至少两种不同采样率的音频特征；神经网络模型训练模块502，用于将音频特征作为输入对神经网络模型进行训练，其中，音频特征标注有语音识别标签和采样率分类标签。Further, the embodiment of the present invention also provides a training device for a multi-sampling rate speech recognition model. As shown in FIG. 5 , the device 50 includes: an audio feature acquisition module 501 for acquiring at least two audio features with different sampling rates; a neural network model training module 502 for training the neural network model by using the audio features as input , where the audio features are annotated with speech recognition labels and sample rate classification labels.

根据本发明一实施方式，其中，音频特征获取模块501包括：音频输入接收单元，用于接收至少两种不同采样率的音频输入；特征提取配置单元，用于根据音频输入所属的采样率分类设定特征提取的配置信息；音频特征提取单元，用于使用配置信息对音频进行特征提取得到至少两种不同采样率的音频特征。According to an embodiment of the present invention, the audio feature acquisition module 501 includes: an audio input receiving unit, configured to receive audio inputs with at least two different sampling rates; a feature extraction configuration unit, configured to classify and set the audio input according to the sampling rate to which the audio input belongs. The configuration information for feature extraction is determined; the audio feature extraction unit is configured to perform feature extraction on audio using the configuration information to obtain at least two audio features with different sampling rates.

根据本发明一实施方式，其中，神经网络模型训练模块502包括：语音识别训练单元，用于针对语音识别标签对神经网络模型进行正常训练；采样率分类训练单元，用于针对采样率分类标签对神经网络模型进行对抗训练。According to an embodiment of the present invention, the neural network model training module 502 includes: a speech recognition training unit for normally training the neural network model for speech recognition labels; a sampling rate classification training unit for classifying label pairs according to the sampling rate Adversarial training of neural network models.

此外，本发明实施例还提供一种多采样率语音识别装置，如图6所示，该装置60包括：音频特征接收模块601，用于接收音频特征；语音识别模块602，用于将音频特征输入给多采样率语音识别模型得到语音识别结果，其中，多采样率语音识别模型是执行上述多采样率语音识别模型的训练方法任一项方法训练得到的。In addition, an embodiment of the present invention also provides a multi-sampling rate speech recognition device. As shown in FIG. 6 , the device 60 includes: an audio feature receiving module 601 for receiving audio features; a speech recognition module 602 for converting the audio features Input to the multi-sampling rate speech recognition model to obtain a speech recognition result, wherein the multi-sampling rate speech recognition model is obtained by performing any one of the above-mentioned training methods for the multi-sampling rate speech recognition model.

这里需要指出的是：以上对针多采样率语音识别模型的训练装置实施例的描述、对针多采样率语音识别装置实施例的描述、以上针对多采样率语音识别系统实施例的描述和以上针对计算机存储介质实施例的描述，与前述方法实施例的描述是类似的，具有同前述方法实施例相似的有益效果，因此不做赘述。对于本发明对多采样率语音识别模型的训练装置实施例的描述、对多采样率语音识别装置实施例的描述、对多采样率语音识别系统实施例的描述和对计算机存储介质实施例的描述尚未披露的技术细节，请参照本发明前述方法实施例的描述而理解，为节约篇幅，因此不再赘述。It should be pointed out here that: the above description of the embodiment of the training apparatus for the multi-sampling rate speech recognition model, the description of the embodiment of the multi-sampling rate speech recognition apparatus, the above description of the embodiment of the multi-sampling rate speech recognition system and the above The description of the embodiment of the computer storage medium is similar to the description of the foregoing method embodiment, and has similar beneficial effects as the foregoing method embodiment, so it will not be repeated. For the description of the embodiment of the multi-sampling rate speech recognition model training apparatus, the description of the embodiment of the multi-sampling rate speech recognition apparatus, the description of the embodiment of the multi-sampling rate speech recognition system and the description of the embodiment of the computer storage medium of the present invention For technical details that have not yet been disclosed, please refer to the description of the foregoing method embodiments of the present invention to understand, and to save space, it will not be repeated.

需要说明的是，在本文中，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or device comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

在本申请所提供的几个实施例中，应该理解到，所揭露的设备和方法，可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的，例如，单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，如：多个单元或组件可以结合，或可以集成到另一个装置，或一些特征可以忽略，或不执行。另外，所显示或讨论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口，设备或单元的间接耦合或通信连接，可以是电性的、机械的或其它形式的。In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The device embodiments described above are only illustrative. For example, the division of units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated. to another device, or some features can be ignored, or not implemented. In addition, the coupling, or direct coupling, or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical or other forms. of.

上述作为分离部件说明的单元可以是、或也可以不是物理上分开的，作为单元显示的部件可以是、或也可以不是物理单元；既可以位于一个地方，也可以分布到多个网络单元上；可以根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。The unit described above as a separate component may or may not be physically separated, and the component displayed as a unit may or may not be a physical unit; it may be located in one place or distributed to multiple network units; Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外，在本发明各实施例中的各功能单元可以全部集成在一个处理单元中，也可以是各单元分别单独作为一个单元，也可以两个或两个以上单元集成在一个单元中；上述集成的单元既可以利用硬件的形式实现，也可以利用硬件加软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may all be integrated into one processing unit, or each unit may be separately used as a unit, or two or more units may be integrated into one unit; the above-mentioned integration The unit can be implemented either in the form of hardware or in the form of hardware plus software functional units.

本领域普通技术人员可以理解：实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成，前述的程序可以存储于计算机可读取存储介质中，该程序在执行时，执行包括上述方法实施例的步骤；而前述的存储介质包括：移动存储介质、只读存储器(Read Only Memory，ROM)、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the steps of implementing the above method embodiments can be completed by program instructions related to hardware, the aforementioned program can be stored in a computer-readable storage medium, and when the program is executed, the execution includes: The steps of the above method embodiments; and the aforementioned storage medium includes: a removable storage medium, a read only memory (Read Only Memory, ROM), a magnetic disk or an optical disk and other media that can store program codes.

或者，本发明上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时，也可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本发明各个实施例方法的全部或部分。而前述的存储介质包括：移动存储介质、ROM、磁碟或者光盘等各种可以存储程序代码的介质。Alternatively, if the above-mentioned integrated unit of the present invention is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present invention may be embodied in the form of software products in essence or the parts that make contributions to the prior art. The computer software products are stored in a storage medium and include several instructions for A computer device (which may be a personal computer, a server, or a network device, etc.) is caused to execute all or part of the methods of the various embodiments of the present invention. The aforementioned storage medium includes: a removable storage medium, a ROM, a magnetic disk, or an optical disk and other mediums that can store program codes.

以上，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应以权利要求的保护范围为准。The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited thereto. Any person skilled in the art can easily think of changes or replacements within the technical scope disclosed by the present invention, and should cover within the protection scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A method for training a multiple sampling rate speech recognition model, the method comprising:

receiving audio inputs of at least two different sampling rates; setting configuration information extracted by the characteristic according to the sampling rate classification of the audio input; performing feature extraction on the audio by using the configuration information to obtain audio features of at least two different sampling rates;

and training a neural network model by taking the audio features as input, wherein the audio features are marked with a voice recognition label and a sampling rate classification label.

2. The method of claim 1, wherein training the neural network model comprises:

and normally training the neural network model according to the voice recognition label, and carrying out countermeasure training on the neural network model according to the sampling rate classification label.

3. The method of claim 2, wherein the training the neural network model against the sample rate class labels comprises:

and carrying out countermeasure training on the neural network model according to a cross entropy training criterion and aiming at the sampling rate classification label.

4. The method of any one of claims 2 or 3, wherein the performing counter training comprises:

and performing countermeasure training by adopting a mode of reversing transmission after gradient reversal.

5. A method of multiple sample rate speech recognition, the method comprising:

receiving an audio feature;

inputting the audio features into a multi-sampling rate speech recognition model to obtain a speech recognition result, wherein the multi-sampling rate speech recognition model is obtained by performing the method of any one of claims 1 to 4.

6. An apparatus for training a multiple sampling rate speech recognition model, the apparatus comprising:

the audio characteristic acquisition module is used for receiving audio input of at least two different sampling rates; setting the configuration information extracted by the characteristic according to the sampling rate classification of the audio input; performing feature extraction on the audio by using the configuration information to obtain audio features of at least two different sampling rates;

and the neural network model training module is used for training the neural network model by taking the audio features as input, wherein the audio features are marked with voice recognition labels and sampling rate classification labels.

7. A multi-sample rate speech recognition apparatus, the apparatus comprising:

the audio characteristic receiving module is used for receiving the audio characteristics;

a speech recognition module, configured to input the audio features into a multi-sampling rate speech recognition model to obtain a speech recognition result, where the multi-sampling rate speech recognition model is trained by performing the method according to any one of claims 1 to 4.

8. A multi-sampling rate speech recognition system comprising a processor and a memory, wherein the memory has stored therein computer program instructions for execution by the processor to perform the method of any of claims 1 to 4;

or to perform the method of claim 5.

9. A storage medium having stored thereon program instructions for performing, when executed, the method of any one of claims 1 to 4;

or, performing the method of claim 5.