CN115359783A

CN115359783A - Phoneme recognition method and device, electronic equipment and storage medium

Info

Publication number: CN115359783A
Application number: CN202210855299.3A
Authority: CN
Inventors: 孙涛; 申凯; 万根顺; 潘嘉; 刘聪; 胡国平; 刘庆峰; 胡郁
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2022-07-19
Filing date: 2022-07-19
Publication date: 2022-11-18
Anticipated expiration: 2042-07-19
Also published as: CN115359783B

Abstract

The present invention provides a phoneme recognition method, device, electronic equipment and storage medium. The method includes: determining the speech to be recognized; inputting the speech to be recognized into a phoneme recognition model to obtain the phoneme recognition result output by the phoneme recognition model; the phoneme recognition model The first recognition model is obtained by training the first recognition model based on the sample speeches of multiple languages and the phoneme-level labels of each sample speech. The first recognition model is based on the similarity between the phonemes corresponding to each phoneme node under the second recognition model. The second recognition model is obtained by screening phoneme nodes under the second recognition model, and the second recognition model includes phoneme nodes corresponding to multiple languages. The phoneme recognition method, device, electronic equipment and storage medium provided by the present invention not only reduce the scale of the phoneme recognition model, but also the phoneme recognition model can accurately distinguish phonemes of different languages.

Description

Phoneme recognition method, device, electronic device and storage medium

技术领域technical field

本发明涉及语音识别技术领域，尤其涉及一种音素识别方法、装置、电子设备和存储介质。The invention relates to the technical field of speech recognition, in particular to a phoneme recognition method, device, electronic equipment and storage medium.

背景技术Background technique

在语音识别领域中，音素作为语音中的最小的单位，若要提高语音识别的准确度，需要提高语音中每个音素的识别准确度。In the field of speech recognition, a phoneme is the smallest unit in speech. To improve the accuracy of speech recognition, it is necessary to improve the recognition accuracy of each phoneme in speech.

在实际应用场景中，语音对应有不同的语种，为了准确对不同语种的语音进行识别，目前多针对每种语种训练一个子模型，并基于这些子模型构建得到音素识别模型，以利用音素识别模型中的各子模型分别对各语种的语音进行音素识别，进而根据音素识别结果得到对应的语音识别结果。然而，随着语种种类的增加，子模型的个数也会增加，导致音素识别模型的规模也会增大，进而影响音素识别模型在本地芯片上的部署。In practical application scenarios, speech corresponds to different languages. In order to accurately recognize speech in different languages, a sub-model is trained for each language at present, and a phoneme recognition model is constructed based on these sub-models to use the phoneme recognition model. The sub-models in each perform phoneme recognition on the speech of each language, and then obtain the corresponding speech recognition results according to the phoneme recognition results. However, as the types of languages increase, the number of sub-models will also increase, resulting in an increase in the size of the phoneme recognition model, which in turn affects the deployment of the phoneme recognition model on the local chip.

发明内容Contents of the invention

本发明提供一种音素识别方法、装置、电子设备和存储介质，用以解决现有技术中音素识别模型规模较大的缺陷。The invention provides a phoneme recognition method, device, electronic equipment and storage medium to solve the defect of large-scale phoneme recognition models in the prior art.

本发明提供一种音素识别方法，包括：The invention provides a phoneme recognition method, comprising:

确定待识别语音；Determine the voice to be recognized;

将所述待识别语音输入至音素识别模型，得到所述音素识别模型输出的音素识别结果；Inputting the speech to be recognized into a phoneme recognition model to obtain a phoneme recognition result output by the phoneme recognition model;

所述音素识别模型基于多个语种的样本语音及各样本语音的音素级标签，对第一识别模型进行训练得到，所述第一识别模型是基于第二识别模型下各音素节点所对应音素之间的相似度，对所述第二识别模型下的音素节点进行筛选得到的，所述第二识别模型包括多个语种分别对应的音素节点。The phoneme recognition model is obtained by training the first recognition model based on the sample speeches of multiple languages and the phoneme-level labels of each sample speech, and the first recognition model is based on the phonemes corresponding to each phoneme node under the second recognition model. The similarity between them is obtained by screening the phoneme nodes under the second recognition model, and the second recognition model includes phoneme nodes corresponding to multiple languages.

根据本发明提供的一种音素识别方法，所述第一识别模型的确定步骤包括：According to a phoneme recognition method provided by the present invention, the step of determining the first recognition model includes:

基于各音素节点所对应音素之间的相似度，对所述第二识别模型下的各音素节点进行聚类，得到多个簇类；Based on the similarity between the phonemes corresponding to each phoneme node, each phoneme node under the second recognition model is clustered to obtain a plurality of clusters;

从各簇类中的音素节点筛选得到当前音素节点，并删除各簇类中除当前音素节点以外的其它音素节点，得到所述第一识别模型。The current phoneme node is obtained by screening the phoneme nodes in each cluster, and deleting other phoneme nodes in each cluster except the current phoneme node, so as to obtain the first recognition model.

根据本发明提供的一种音素识别方法，所述第二识别模型包括特征提取层和多个语种分别对应的音素分类层，各音素分类层基于各语种对应的音素节点构建得到；According to a phoneme recognition method provided by the present invention, the second recognition model includes a feature extraction layer and phoneme classification layers corresponding to multiple languages, and each phoneme classification layer is constructed based on phoneme nodes corresponding to each language;

所述第二识别模型基于如下步骤训练得到：The second recognition model is trained based on the following steps:

将各语种的样本语音输入至所述第二识别模型的特征提取层，得到所述第二识别模型的特征提取层输出的第一音素隐层特征；Input the sample speech of each language to the feature extraction layer of the second recognition model, and obtain the first phoneme hidden layer feature output by the feature extraction layer of the second recognition model;

将所述第一音素隐层特征输入至各语种的音素分类层，得到各语种的音素分类层输出的第一音素预测结果；The first phoneme hidden layer feature is input to the phoneme classification layer of each language to obtain the first phoneme prediction result output by the phoneme classification layer of each language;

基于所述音素级标签与所述第一音素预测结果之间的差异，对所述第二识别模型的特征提取层和各语种的音素分类层进行参数迭代，得到所述第二识别模型。Based on the difference between the phoneme-level label and the first phoneme prediction result, parameter iteration is performed on the feature extraction layer of the second recognition model and the phoneme classification layer of each language to obtain the second recognition model.

根据本发明提供的一种音素识别方法，所述得到所述第二识别模型的特征提取层输出的第一音素隐层特征，之后还包括：According to a kind of phoneme recognition method provided by the present invention, the first phoneme hidden layer feature obtained by the feature extraction layer output of the second recognition model further includes:

基于所述第一音素隐层特征，确定字级隐层特征和/或句级隐层特征；Based on the first phoneme hidden layer features, determine word-level hidden layer features and/or sentence-level hidden layer features;

基于所述样本语音的字级标签与字级预测结果之间的差异和/或所述样本语音的语种标签与语种预测结果之间的差异，对所述第二识别模型的特征提取层进行参数迭代，得到所述第二识别模型；所述字级预测结果基于所述字级隐层特征确定，所述语种预测结果基于所述句级隐层特征确定。Based on the difference between the word-level label of the sample speech and the word-level prediction result and/or the difference between the language label of the sample speech and the language prediction result, parameterize the feature extraction layer of the second recognition model Iterating to obtain the second recognition model; the word-level prediction result is determined based on the word-level hidden layer features, and the language type prediction result is determined based on the sentence-level hidden layer features.

根据本发明提供的一种音素识别方法，所述基于所述样本语音的字级标签与字级预测结果之间的差异和/或所述样本语音的语种标签与语种预测结果之间的差异，对所述第二识别模型的特征提取层进行参数迭代，得到所述第二识别模型，包括：According to a phoneme recognition method provided by the present invention, the difference between the word-level label based on the sample speech and the word-level prediction result and/or the difference between the language label of the sample speech and the language prediction result, Perform parameter iteration on the feature extraction layer of the second recognition model to obtain the second recognition model, including:

将所述字级隐层特征输入至字级分类层，得到所述字级分类层输出的所述字级预测结果，和/或，将所述句级隐层特征输入至语种分类层，得到所述语种分类层输出的所述语种预测结果；The word-level hidden layer features are input to the word-level classification layer to obtain the word-level prediction results output by the word-level classification layer, and/or, the sentence-level hidden layer features are input to the language classification layer to obtain The language prediction result output by the language classification layer;

基于所述字级标签与所述字级预测结果之间的差异和/或所述语种标签与所述语种预测结果之间的差异，对所述第二识别模型的特征提取层进行参数迭代，得到所述第二识别模型。performing parameter iteration on the feature extraction layer of the second recognition model based on the difference between the word-level label and the word-level prediction result and/or the difference between the language label and the language prediction result, Obtain the second recognition model.

根据本发明提供的一种音素识别方法，所述基于所述第一音素隐层特征，确定字级隐层特征和/或句级隐层特征，包括：According to a kind of phoneme recognition method provided by the present invention, the described first phoneme hidden layer feature is used to determine the word-level hidden layer feature and/or sentence-level hidden layer feature, including:

对所述第一音素隐层特征进行滑窗，得到所述字级隐层特征；Carry out sliding window to described first phoneme hidden layer feature, obtain described word level hidden layer feature;

对所述字级隐层特征进行池化，得到所述句级隐层特征。The word-level hidden layer features are pooled to obtain the sentence-level hidden layer features.

根据本发明提供的一种音素识别方法，所述音素识别模型基于如下步骤训练得到：According to a method for phoneme recognition provided by the present invention, the phoneme recognition model is trained based on the following steps:

固定所述第一识别模型的特征提取层的参数；fixing the parameters of the feature extraction layer of the first recognition model;

将各语种的样本语音输入至所述第一识别模型的特征提取层，得到所述第一识别模型的特征提取层输出的第二音素隐层特征；Input the sample speech of each language to the feature extraction layer of the first recognition model, obtain the second phoneme hidden layer feature output by the feature extraction layer of the first recognition model;

将所述第二音素隐层特征输入至当前音素分类层，得到所述当前音素分类层输出的第二音素预测结果；所述当前音素分类层基于从所述第二识别模型中筛选得到的音素节点构建得到；The second phoneme hidden layer feature is input to the current phoneme classification layer to obtain the second phoneme prediction result output by the current phoneme classification layer; the current phoneme classification layer is based on the phonemes screened from the second recognition model The node is constructed;

基于所述音素级标签与所述第二音素预测结果之间的差异，对所述当前音素分类层进行参数迭代，得到所述音素识别模型。Based on the difference between the phoneme-level label and the second phoneme prediction result, parameter iteration is performed on the current phoneme classification layer to obtain the phoneme recognition model.

本发明还提供一种音素识别装置，包括：The present invention also provides a phoneme recognition device, comprising:

确定单元，用于确定待识别语音；a determining unit, configured to determine the speech to be recognized;

识别单元，用于述待识别语音输入至音素识别模型，得到所述音素识别模型输出的音素识别结果；The recognition unit is used to input the speech to be recognized to the phoneme recognition model, and obtain the phoneme recognition result output by the phoneme recognition model;

本发明还提供一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如上述任一种所述音素识别方法。The present invention also provides an electronic device, including a memory, a processor, and a computer program stored on the memory and operable on the processor, when the processor executes the program, the phoneme recognition method described in any one of the above is implemented .

本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现如上述任一种所述音素识别方法。The present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the phoneme recognition method described above can be implemented.

本发明还提供一种计算机程序产品，包括计算机程序，所述计算机程序被处理器执行时实现如上述任一种所述音素识别方法。The present invention also provides a computer program product, including a computer program. When the computer program is executed by a processor, any phoneme recognition method described above is implemented.

本发明提供的音素识别方法、装置、电子设备和存储介质，基于第二识别模型下各音素节点所对应音素之间的相似度，对第二识别模型下的音素节点进行筛选得到第一识别模型，不仅减小了第一识别模型的规模，而且在第一识别模型中保留了不同音素对应的音素节点，进而在基于多个语种的样本语音及各样本语音的音素级标签对第一识别模型进行训练后，不仅使得得到的音素识别模型的规模小于第二识别模型，而且音素识别模型能够准确对不同语种的音素进行区分。The phoneme recognition method, device, electronic equipment and storage medium provided by the present invention, based on the similarity between phonemes corresponding to each phoneme node under the second recognition model, screen the phoneme nodes under the second recognition model to obtain the first recognition model , not only reduces the scale of the first recognition model, but also retains the phoneme nodes corresponding to different phonemes in the first recognition model, and then based on the sample speech of multiple languages and the phoneme-level labels of each sample speech to the first recognition model After training, not only the scale of the obtained phoneme recognition model is smaller than the second recognition model, but also the phoneme recognition model can accurately distinguish phonemes of different languages.

附图说明Description of drawings

为了更清楚地说明本发明或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the present invention or the technical solutions in the prior art, the accompanying drawings that need to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the accompanying drawings in the following description are the present invention. For some embodiments of the invention, those skilled in the art can also obtain other drawings based on these drawings without creative effort.

图1是本发明提供的音素识别方法的流程示意图；Fig. 1 is a schematic flow chart of the phoneme recognition method provided by the present invention;

图2是本发明提供的第一识别模型确定方法的流程示意图；Fig. 2 is a schematic flow chart of the method for determining the first identification model provided by the present invention;

图3是本发明提供的第二识别模型训练方法的流程示意图；Fig. 3 is a schematic flow chart of the second recognition model training method provided by the present invention;

图4是本发明提供的又一第二识别模型训练方法的流程示意图；Fig. 4 is a schematic flowchart of another second recognition model training method provided by the present invention;

图5是本发明提供的又一第二识别模型训练方法中步骤420的实施方式的流程示意图；FIG. 5 is a schematic flowchart of an implementation of Step 420 in yet another second recognition model training method provided by the present invention;

图6是本发明提供的音素识别模型训练方法的流程示意图；Fig. 6 is a schematic flow chart of the phoneme recognition model training method provided by the present invention;

图7是本发明提供的再一第二识别模型训练方法的流程示意图；Fig. 7 is a schematic flowchart of another second recognition model training method provided by the present invention;

图8是本发明提供的音素识别装置的结构示意图；Fig. 8 is a schematic structural diagram of a phoneme recognition device provided by the present invention;

图9是本发明提供的电子设备的结构示意图。FIG. 9 is a schematic structural diagram of an electronic device provided by the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合本发明中的附图，对本发明中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the present invention clearer, the technical solutions in the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the present invention. Obviously, the described embodiments are part of the embodiments of the present invention , but not all examples. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

目前，在对不同语种进行语音识别时，多通过对每种语种训练一个子模型，并基于这些子模型构建得到音素识别模型，以利用音素识别模型中的各子模型分别对各语种的语音进行音素识别，进而根据音素识别结果得到对应的语音识别结果。然而，随着语种种类的增加，子模型的个数也会增加，导致音素识别模型的规模也会增大，进而影响音素识别模型在本地芯片上的部署。At present, when performing speech recognition on different languages, a sub-model is usually trained for each language, and a phoneme recognition model is constructed based on these sub-models, so as to use each sub-model in the phoneme recognition model to perform speech recognition in each language. phoneme recognition, and then obtain a corresponding speech recognition result according to the phoneme recognition result. However, as the types of languages increase, the number of sub-models will also increase, resulting in an increase in the size of the phoneme recognition model, which in turn affects the deployment of the phoneme recognition model on the local chip.

此外，为了避免增大音素识别模型的规模，也有通过引入语种分类分支，然后在语种分类分支和主分支之间插入一个梯度反转层，以通过梯度对抗训练使音素识别模型学习到语种不变特征，但该方法适用于差异不大的语种，对于差异较大的语种(如粤语，闽南语等与普通话差异较大)识别效果较差。In addition, in order to avoid increasing the scale of the phoneme recognition model, it is also possible to introduce a language classification branch, and then insert a gradient inversion layer between the language classification branch and the main branch, so that the phoneme recognition model can learn the language unchanged through gradient confrontation training. features, but this method is suitable for languages with little difference, and the recognition effect is poor for languages with large differences (such as Cantonese, Hokkien, etc., which are quite different from Mandarin).

对此，本发明提供一种音素识别方法。图1是本发明提供的音素识别方法的流程示意图，如图1所示，该方法包括如下步骤：For this, the present invention provides a phoneme recognition method. Fig. 1 is a schematic flow chart of the phoneme recognition method provided by the present invention, as shown in Fig. 1, the method comprises the following steps:

步骤110、确定待识别语音。Step 110, determine the speech to be recognized.

此处，待识别语音即需要进行音素识别的语音。待识别语音可以通过拾音设备得到，此处拾音设备可以是智能手机、平板电脑，还可以是智能电器例如音响、电视和空调等，拾音设备在经过麦克风阵列拾音得到待识别语音后，还可以对待识别语音进行放大和降噪，本发明实施例对此不作具体限定。Here, the speech to be recognized is the speech that requires phoneme recognition. The voice to be recognized can be obtained through a sound pickup device. The sound pickup device here can be a smartphone, a tablet computer, or a smart appliance such as a stereo, a TV, and an air conditioner. After the sound pickup device is picked up by a microphone array to obtain the voice to be recognized , the speech to be recognized may also be amplified and denoised, which is not specifically limited in this embodiment of the present invention.

步骤120、将待识别语音输入至音素识别模型，得到音素识别模型输出的音素识别结果；Step 120, input the speech to be recognized into the phoneme recognition model, and obtain the phoneme recognition result output by the phoneme recognition model;

音素识别模型基于多个语种的样本语音及各样本语音的音素级标签，对第一识别模型进行训练得到，第一识别模型是基于第二识别模型下各音素节点所对应音素之间的相似度，对第二识别模型下的音素节点进行筛选得到的，第二识别模型包括多个语种分别对应的音素节点。The phoneme recognition model is obtained by training the first recognition model based on sample speech in multiple languages and the phoneme-level labels of each sample speech. The first recognition model is based on the similarity between phonemes corresponding to each phoneme node under the second recognition model , which is obtained by filtering the phoneme nodes under the second recognition model, and the second recognition model includes phoneme nodes respectively corresponding to a plurality of languages.

此处，各语种的音素节点分别对应各语种不同的音素，如普通话中“a”和“i”是不同的音素，从而普通话中“a”和“i”对应不同的音素节点。第二识别模型包括多个语种分别对应的音素节点，从而第二识别模型能够通过各语种对应的音素节点对不同语种下的音素进行区分，也就是第二识别模型具备准确对不同语种的音素进行区分能力。Here, the phoneme nodes of each language correspond to different phonemes of each language. For example, "a" and "i" in Mandarin are different phonemes, so "a" and "i" in Mandarin correspond to different phoneme nodes. The second recognition model includes phoneme nodes corresponding to multiple languages, so that the second recognition model can distinguish phonemes in different languages through the phoneme nodes corresponding to each language, that is, the second recognition model has the ability to accurately identify phonemes in different languages. ability to distinguish.

然而，若存在大量不同类别的语种，则会导致第二识别模型中的音素节点过多，进而导致第二模型的计算参数量较大。此外，考虑到同类语种中可能存在相似度较高的音素，不同类语种间也可能存在相似度较高的音素，也就是第二识别模型中各语种分别对应的音素节点可能存在冗余。However, if there are a large number of different types of languages, there will be too many phoneme nodes in the second recognition model, which will lead to a large amount of calculation parameters of the second model. In addition, considering that there may be phonemes with high similarity in the same language, there may also be phonemes with high similarity in different languages, that is, there may be redundancy in the phoneme nodes corresponding to each language in the second recognition model.

对此，本发明实施例基于第二识别模型下各音素节点所对应音素之间的相似度，对第二识别模型下的音素节点进行筛选，如基于音素之间的相似度，对各音素节点进行聚类，将相似度较高的音素对应的音素节点聚为一类，然后选取同一类中的任一音素节点作为第一识别模型的当前音素节点，删除该类中的其余音素节点，从而能够减少第一识别模型中音素节点的数量，以减小第一识别模型的规模，相应地也减小了音素识别模型的规模。In this regard, the embodiment of the present invention screens the phoneme nodes under the second recognition model based on the similarity between the phonemes corresponding to each phoneme node under the second recognition model, such as based on the similarity between phonemes, for each phoneme node Carry out clustering, cluster the phoneme nodes corresponding to the phonemes with high similarity into one class, then select any phoneme node in the same class as the current phoneme node of the first recognition model, and delete the remaining phoneme nodes in this class, so that The number of phoneme nodes in the first recognition model can be reduced, so as to reduce the scale of the first recognition model, and correspondingly reduce the scale of the phoneme recognition model.

在对第二识别模型下的音素节点进行筛选后，滤除了第二识别模型下的冗余音素节点，即滤除对应有相似音素的音素节点，也就是第一识别模型中包含的当前音素节点同样对应有不同类别的音素，从而在基于多个语种的样本语音及各样本语音的音素级标签，对第一识别模型进行训练后，得到的音素识别模型能够准确对不同语种的音素进行区分。After screening the phoneme nodes under the second recognition model, the redundant phoneme nodes under the second recognition model are filtered out, that is, the phoneme nodes corresponding to similar phonemes are filtered out, that is, the current phoneme nodes included in the first recognition model There are also different types of phonemes, so that after training the first recognition model based on sample speech in multiple languages and the phoneme-level labels of each sample speech, the resulting phoneme recognition model can accurately distinguish phonemes in different languages.

本发明实施例提供的音素识别方法，基于第二识别模型下各音素节点所对应音素之间的相似度，对第二识别模型下的音素节点进行筛选得到第一识别模型，不仅减小了第一识别模型的规模，而且在第一识别模型中保留了不同音素对应的音素节点，进而在基于多个语种的样本语音及各样本语音的音素级标签对第一识别模型进行训练后，不仅使得得到的音素识别模型的规模小于第二识别模型，而且音素识别模型能够准确对不同语种的音素进行区分。The phoneme recognition method provided by the embodiment of the present invention is based on the similarity between phonemes corresponding to each phoneme node under the second recognition model, and screens the phoneme nodes under the second recognition model to obtain the first recognition model, which not only reduces the The scale of a recognition model, and the phoneme nodes corresponding to different phonemes are reserved in the first recognition model, and then after training the first recognition model based on the sample speeches of multiple languages and the phoneme-level labels of each sample speech, not only makes The scale of the obtained phoneme recognition model is smaller than that of the second recognition model, and the phoneme recognition model can accurately distinguish phonemes of different languages.

基于上述实施例，图2是本发明提供的第一识别模型确定方法的流程示意图，如图2所示，第一识别模型的确定步骤包括：Based on the above-mentioned embodiment, Fig. 2 is a schematic flowchart of the method for determining the first recognition model provided by the present invention. As shown in Fig. 2, the steps of determining the first recognition model include:

步骤210、基于各音素节点所对应音素之间的相似度，对第二识别模型下的各音素节点进行聚类，得到多个簇类；Step 210, based on the similarity between the phonemes corresponding to each phoneme node, cluster each phoneme node under the second recognition model to obtain a plurality of clusters;

步骤220、从各簇类中的音素节点筛选得到当前音素节点，并删除各簇类中除当前音素节点以外的其它音素节点，得到第一识别模型。Step 220 : Screen the phoneme nodes in each cluster to obtain the current phoneme node, and delete other phoneme nodes in each cluster except the current phoneme node to obtain the first recognition model.

具体地，音素之间的相似度用于表征对应两个音素属于同一类别的概率，音素之间的相似度越高，表明对应两个音素属于同一类别的概率越高；反之，音素之间的相似度越低，表明对应两个音素属于同一类别的概率越低。Specifically, the similarity between phonemes is used to represent the probability that two corresponding phonemes belong to the same category. The higher the similarity between phonemes, the higher the probability that the corresponding two phonemes belong to the same category; The lower the similarity, the lower the probability that the corresponding two phonemes belong to the same category.

基于各音素节点所对应音素之间的相似度，将对应相同类别音素的音素节点聚为一类，得到多个簇类，即各簇类中包含的音素节点对应的音素类别相同或相似。Based on the similarity between the phonemes corresponding to each phoneme node, the phoneme nodes corresponding to the same type of phoneme are clustered into one class, and multiple clusters are obtained, that is, the phoneme classes corresponding to the phoneme nodes contained in each cluster class are the same or similar.

此外，当前音素节点指第一识别模型的音素节点，不同簇类中当前音素节点对应的音素类别不同。当前音素节点在得到多个簇类后，从各簇类中的音素节点筛选得到的，当前音素节点可以为一个，也可以为多个，但各簇类中当前音素节点的数量总和小于第二识别模型中音素节点的数量总和。In addition, the current phoneme node refers to the phoneme node of the first recognition model, and the phoneme categories corresponding to the current phoneme node in different clusters are different. After obtaining multiple clusters, the current phoneme node is filtered from the phoneme nodes in each cluster. The current phoneme node can be one or more, but the total number of current phoneme nodes in each cluster is less than the second The sum of the number of phoneme nodes in the recognition model.

可选地，可以将各簇类中的任意一个音素节点或多个音素节点作为当前音素节点，也可以将与各簇类中心的距离小于阈值的音素节点作为当前音素节点，还可以将距离各簇类中心最近的音素节点作为当前音素节点，本发明实施例对此不作具体限定。Optionally, any phoneme node or multiple phoneme nodes in each cluster can be used as the current phoneme node, or the phoneme node whose distance from the center of each cluster is less than the threshold can be used as the current phoneme node, or the distance from each The phoneme node closest to the cluster center is used as the current phoneme node, which is not specifically limited in this embodiment of the present invention.

在确定各簇类中的当前音素节点后，各簇类中的其它音素节点与当前音素节点的类别相同或相似，也就是其它音素节点可以看作是冗余因素节点。对此，本发明实施例在得到各簇类的当前音素节点后，删除各簇类中除当前音素节点以外的其它音素节点，得到第一识别模型，即可以理解为第一识别模型是从第二识别模型中删除了冗余音素节点后得到的模型。After the current phoneme node in each cluster is determined, other phoneme nodes in each cluster are of the same or similar category as the current phoneme node, that is, other phoneme nodes can be regarded as redundant factor nodes. In this regard, in the embodiment of the present invention, after obtaining the current phoneme nodes of each cluster, delete other phoneme nodes in each cluster except the current phoneme node to obtain the first recognition model, which can be understood as the first recognition model is derived from the first The model obtained after deleting redundant phoneme nodes in the second recognition model.

由此可见，本发明实施例基于各音素节点所对应音素之间的相似度，可以准确对第二识别模型下的各音素节点进行聚类，进而得到多个簇类。同时，本发明实施例从各簇类中的音素节点筛选得到当前音素节点，并删除各簇类中除当前音素节点以外的其它音素节点，从而能够减少第一识别模型中音素节点的数量，实现减小第一识别模型的规模，进而相应减小了音素识别模型的规模。It can be seen that, based on the similarity between phonemes corresponding to each phoneme node, the embodiment of the present invention can accurately cluster each phoneme node under the second recognition model, and then obtain multiple clusters. At the same time, the embodiment of the present invention obtains the current phoneme node from the phoneme nodes in each cluster, and deletes other phoneme nodes in each cluster except the current phoneme node, thereby reducing the number of phoneme nodes in the first recognition model and realizing The scale of the first recognition model is reduced, thereby correspondingly reducing the scale of the phoneme recognition model.

作为一种可选实施例，在对第二识别模型下的各音素节点进行聚类时，可以通过高斯混合模型(Gaussian mixture model，GMM模型)和期望最大化算法(Expectation-Maximum,EM)对各音素节点对应的音素进行聚类。As an optional embodiment, when each phoneme node under the second recognition model is clustered, a Gaussian mixture model (Gaussian mixture model, GMM model) and an expectation maximization algorithm (Expectation-Maximum, EM) can be used to The phonemes corresponding to each phoneme node are clustered.

例如，第二识别模型包括N_a个语种对应的音素节点，且每个语种对应的音素节点个数为N_c，也就是第二识别模型包含的音素节点总数为N_a×N_c个，若要使得第一识别模型中音素节点总数为N_c个，则可以通过GMM模型对各音素节点进行聚类，然后根据聚类结果不断迭代调整GMM模型的参数，直至聚类结果为将第二识别模型中的各音素节点划分为N_c个簇类。For example, the second recognition model includes phoneme nodes corresponding to N _a languages, and the number of phoneme nodes corresponding to each language is N _c , that is, the total number of phoneme nodes included in the second recognition model is N _a × N _c , if To make the total number of phoneme nodes in the first recognition model be _Nc , each phoneme node can be clustered through the GMM model, and then iteratively adjust the parameters of the GMM model according to the clustering results until the clustering result is the second recognition Each phoneme node in the model is divided into _Nc clusters.

需要说明的是，本发明实施例还可以根据实际需求将第二识别模型中的各音素节点划分为其它数量的簇类，本发明实施例对此不作具体限定。It should be noted that, in the embodiment of the present invention, each phoneme node in the second recognition model can also be divided into other numbers of clusters according to actual requirements, which is not specifically limited in the embodiment of the present invention.

基于上述任一实施例，第二识别模型包括特征提取层和多个语种分别对应的音素分类层，各音素分类层基于各语种对应的音素节点构建得到。图3是本发明提供的第二识别模型训练方法的流程示意图，如图3所示，第二识别模型的训练步骤包括：Based on any of the above embodiments, the second recognition model includes a feature extraction layer and phoneme classification layers corresponding to multiple languages, and each phoneme classification layer is constructed based on phoneme nodes corresponding to each language. Fig. 3 is a schematic flow chart of the second recognition model training method provided by the present invention. As shown in Fig. 3, the training steps of the second recognition model include:

步骤310、将各语种的样本语音输入至第二识别模型的特征提取层，得到第二识别模型的特征提取层输出的第一音素隐层特征；Step 310, input the sample speech of each language to the feature extraction layer of the second recognition model, and obtain the first phoneme hidden layer feature output by the feature extraction layer of the second recognition model;

步骤320、将第一音素隐层特征输入至各语种的音素分类层，得到各语种的音素分类层输出的第一音素预测结果；Step 320, input the first phoneme hidden layer feature to the phoneme classification layer of each language, and obtain the first phoneme prediction result output by the phoneme classification layer of each language;

步骤330、基于音素级标签与第一音素预测结果之间的差异，对第二识别模型的特征提取层和各语种的音素分类层进行参数迭代，得到第二识别模型。Step 330 : Based on the difference between the phoneme-level label and the first phoneme prediction result, perform parameter iteration on the feature extraction layer of the second recognition model and the phoneme classification layer of each language to obtain the second recognition model.

具体地，第一音素隐层特征用于表征样本语音中各音素的特征信息，其可以理解为帧级隐层特征。第二识别模型的特征提取层用于提取各语种的样本语音对应的第一音素隐层特征。其中，第二识别模型中的特征提取层是共享的，也就是各语种的样本语音均可由该特征提取层进行特征提取。此外，第二识别模型的特征提取层可以采用DNN(Deep NeuralNetwork，深度神经网络)、RNN(Recurrent Neural Network，循环神经网络)或者CNN(Convolution Neural Network，卷积神经网络)等神经网络模型提取第一音素隐层特征，本发明实施例对此不作具体限定。Specifically, the first phoneme hidden layer feature is used to represent the feature information of each phoneme in the sample speech, which can be understood as a frame-level hidden layer feature. The feature extraction layer of the second recognition model is used to extract the features of the first phoneme hidden layer corresponding to the sample speeches of various languages. Wherein, the feature extraction layer in the second recognition model is shared, that is, the sample speeches of various languages can be feature extracted by the feature extraction layer. In addition, the feature extraction layer of the second recognition model can use neural network models such as DNN (Deep Neural Network, deep neural network), RNN (Recurrent Neural Network, cyclic neural network) or CNN (Convolution Neural Network, convolutional neural network) to extract the first A phoneme hidden layer feature, which is not specifically limited in this embodiment of the present invention.

此外，第二识别模型还包括多个语种分别对应的音素分类层，也就是各语种对应的音素分类层是相互独立的，从而各语种对应的音素分类层可以独立学习对应语种的音素信息，避免不同语种之间发音冲突对音素识别的影响，进而准确对该语种下的音素进行识别，得到第一音素预测结果。In addition, the second recognition model also includes phoneme classification layers corresponding to multiple languages, that is, the phoneme classification layers corresponding to each language are independent of each other, so that the phoneme classification layers corresponding to each language can independently learn the phoneme information of the corresponding language, avoiding The impact of pronunciation conflicts between different languages on phoneme recognition, and then accurately recognize the phonemes in the language, and obtain the first phoneme prediction result.

在得到第一音素预测结果之后，基于音素级标签与第一音素预测结果之间的差异，对第二识别模型的特征提取层和各语种的音素分类层进行参数迭代，使得第二识别模型在训练过程中能够尽量学习各语种下不同类别音素的信息，从而使得第二识别模型能够准确对各语种下的音素。After the first phoneme prediction result is obtained, based on the difference between the phoneme-level label and the first phoneme prediction result, parameter iteration is performed on the feature extraction layer of the second recognition model and the phoneme classification layer of each language, so that the second recognition model is in During the training process, the information of different types of phonemes in each language can be learned as much as possible, so that the second recognition model can accurately identify phonemes in each language.

由此可见，本发明实施例基于音素级标签与第一音素预测结果之间的差异，对第二识别模型的特征提取层和各语种的音素分类层进行参数迭代，能够使得训练得到的第二识别模型准确识别各语种下的音素。It can be seen that, based on the difference between the phoneme-level label and the first phoneme prediction result, the embodiment of the present invention performs parameter iteration on the feature extraction layer of the second recognition model and the phoneme classification layer of each language, so that the second The recognition model accurately recognizes phonemes in each language.

作为一种可选实施例，第二识别模型中的特征提取层可以包括第一编码层和第一注意力层，第一编码层用于对各语种的样本语音进行编码，得到各样本语音的第一编码特征，第一注意力层用于基于注意力机制，对各样本语音的第一编码特征进行注意力变换，得到第一音素隐层特征。此外，第二识别模型中各语种的音素分类层可以包括第一解码层和第一识别层，第一解码层用于对第一音素隐层特征进行解码，得到各样本语音的第一解码特征，第一识别层用于基于各样本语音的第一解码特征进行音素识别，得到第一音素预测结果。As an optional embodiment, the feature extraction layer in the second recognition model may include a first encoding layer and a first attention layer, and the first encoding layer is used to encode sample speeches of various languages to obtain the The first coding feature, the first attention layer is used to perform attention transformation on the first coding feature of each sample speech based on the attention mechanism, so as to obtain the first phoneme hidden layer feature. In addition, the phoneme classification layer of each language in the second recognition model may include a first decoding layer and a first recognition layer, and the first decoding layer is used to decode the first phoneme hidden layer features to obtain the first decoding features of each sample speech , the first recognition layer is used to perform phoneme recognition based on the first decoding feature of each sample speech, to obtain a first phoneme prediction result.

基于上述任一实施例，图4是本发明提供的又一第二识别模型训练方法的流程示意图，如图4所示，第二识别模型的训练步骤包括：Based on any of the above-mentioned embodiments, FIG. 4 is a schematic flowchart of another second recognition model training method provided by the present invention. As shown in FIG. 4, the training steps of the second recognition model include:

步骤410、得到第二识别模型的特征提取层输出的第一音素隐层特征之后，基于第一音素隐层特征，确定字级隐层特征和/或句级隐层特征；Step 410: After obtaining the first phoneme hidden layer features output by the feature extraction layer of the second recognition model, determine word-level hidden layer features and/or sentence-level hidden layer features based on the first phoneme hidden layer features;

步骤420、基于样本语音的字级标签与字级预测结果之间的差异和/或样本语音的语种标签与语种预测结果之间的差异，对第二识别模型的特征提取层进行参数迭代，得到第二识别模型；字级预测结果基于字级隐层特征确定，语种预测结果基于句级隐层特征确定。Step 420: Based on the difference between the word-level label of the sample speech and the word-level prediction result and/or the difference between the language label of the sample speech and the language prediction result, perform parameter iteration on the feature extraction layer of the second recognition model to obtain The second recognition model: word-level prediction results are determined based on word-level hidden layer features, and language prediction results are determined based on sentence-level hidden layer features.

具体地，字级隐层特征用于表征样本语音中各字符的特征信息，由于各字符是基于多个音素构建得到的，从而在确定字级隐层特征时，需要基于字符对应的多个第一音素隐层特征确定。句级隐层特征用于表征样本语音中各分句的特征信息，由于各分句是基于多个字符构建得到的，从而在确定句级隐层特征时，需要基于分句对应的多个字级隐层特征确定。Specifically, the word-level hidden layer features are used to represent the feature information of each character in the sample speech. Since each character is constructed based on multiple phonemes, when determining the word-level hidden layer features, it is necessary to base on the multiple first phonemes corresponding to the characters. A phoneme hidden layer feature is determined. The sentence-level hidden layer features are used to represent the feature information of each sentence in the sample speech. Since each sentence is constructed based on multiple characters, when determining the sentence-level hidden layer features, it needs to be based on the multiple characters corresponding to the sentence. Level hidden layer features are determined.

基于样本语音的字级标签与字级预测结果之间的差异，对第二识别模型的特征提取层进行参数迭代时，可以使得第二识别模型的特征提取层从字级层面学习各语种下的不同音素信息，进而能够从字级层面准确识别不同音素。Based on the difference between the word-level tags of the sample speech and the word-level prediction results, when the parameter iteration is performed on the feature extraction layer of the second recognition model, the feature extraction layer of the second recognition model can learn from the word-level level. Different phoneme information, and then can accurately identify different phonemes from the word level.

基于样本语音的语种标签与语种预测结果之间的差异，对第二识别模型的特征提取层进行参数迭代时，可以使得第二识别模型的特征提取层从句子级层面学习各语种下的不同音素信息，进而能够从句子级层面准确识别不同音素。Based on the difference between the language label of the sample speech and the language prediction result, when the parameter iteration is performed on the feature extraction layer of the second recognition model, the feature extraction layer of the second recognition model can learn different phonemes in each language from the sentence level. information, and thus be able to accurately identify different phonemes from the sentence level.

由此可见，本发明实施例基于样本语音的字级标签与字级预测结果之间的差异和/或样本语音的语种标签与语种预测结果之间的差异，对第二识别模型的特征提取层进行参数迭代，可以使得第二识别模型还能够从颗粒度较大的字级和/或句子级层面准确识别不同音素，进一步提高第二识别模型的音素识别效果。It can be seen that, in the embodiment of the present invention, based on the difference between the word-level label of the sample speech and the word-level prediction result and/or the difference between the language label of the sample speech and the language prediction result, the feature extraction layer of the second recognition model Performing parameter iterations can enable the second recognition model to accurately recognize different phonemes at a granular character level and/or sentence level, further improving the phoneme recognition effect of the second recognition model.

基于上述任一实施例，图5是本发明提供的又一第二识别模型训练方法中步骤420的实施方式的流程示意图，如图5所示，步骤420包括：Based on any of the above-mentioned embodiments, FIG. 5 is a schematic flowchart of an implementation of step 420 in yet another second recognition model training method provided by the present invention. As shown in FIG. 5, step 420 includes:

步骤421、将字级隐层特征输入至字级分类层，得到字级分类层输出的字级预测结果，和/或，将句级隐层特征输入至语种分类层，得到语种分类层输出的语种预测结果；Step 421: Input the word-level hidden layer features into the word-level classification layer to obtain the word-level prediction results output by the word-level classification layer, and/or input the sentence-level hidden layer features into the language classification layer to obtain the output of the language classification layer Language prediction results;

步骤422、基于字级标签与字级预测结果之间的差异和/或语种标签与语种预测结果之间的差异，对第二识别模型的特征提取层进行参数迭代，得到第二识别模型。Step 422 : Based on the difference between the word-level label and the word-level prediction result and/or the difference between the language label and the language prediction result, perform parameter iteration on the feature extraction layer of the second recognition model to obtain the second recognition model.

具体地，字级分类层用于基于字级隐层特征确定字级预测结果，语种分类层用于基于句级隐层特征确定语种预测结果。其中，字级预测结果可以理解为样本语音中各字符的预测结果，语种预测结果可以理解为样本语音中各分句的语种预测结果。Specifically, the word-level classification layer is used to determine the word-level prediction result based on the word-level hidden layer features, and the language classification layer is used to determine the language type prediction result based on the sentence-level hidden layer features. Wherein, the word-level prediction result can be understood as the prediction result of each character in the sample speech, and the language prediction result can be understood as the language prediction result of each clause in the sample speech.

可选地，本发明实施例可以基于样本语音的字级标签与字级预测结果之间的差异，对第二识别模型的特征提取层进行参数迭代，从而使得第二识别模型的特征提取层能够从字级层面学习各语种下的不同音素信息，进而能够从字级层面准确识别不同音素。Optionally, this embodiment of the present invention may perform parameter iteration on the feature extraction layer of the second recognition model based on the difference between the word-level label of the sample speech and the word-level prediction result, so that the feature extraction layer of the second recognition model can Learn different phoneme information in various languages at the word level, and then be able to accurately identify different phonemes at the word level.

可选地，本发明实施例可以基于样本语音的语种标签与语种预测结果之间的差异，对第二识别模型的特征提取层进行参数迭代，从而使得第二识别模型的特征提取层能够从句子级层面学习各语种下的不同音素信息，进而能够从句子级层面准确识别不同音素。Optionally, in this embodiment of the present invention, parameter iteration may be performed on the feature extraction layer of the second recognition model based on the difference between the language label of the sample speech and the language prediction result, so that the feature extraction layer of the second recognition model can learn from the sentence Learn different phoneme information in each language at the level level, and then be able to accurately identify different phonemes at the sentence level level.

可选地，本发明实施例可以基于样本语音的字级标签与字级预测结果之间的差异和样本语音的语种标签与语种预测结果之间的差异，对第二识别模型的特征提取层进行参数迭代，从而使得第二识别模型的特征提取层能够从字级和句子级层面学习各语种下的不同音素信息，进而能够从字级和句子级层面准确识别不同音素。Optionally, in this embodiment of the present invention, the feature extraction layer of the second recognition model may be based on the difference between the word-level label of the sample speech and the word-level prediction result and the difference between the language label of the sample speech and the language prediction result. The parameters are iterated, so that the feature extraction layer of the second recognition model can learn different phoneme information in each language from the word level and sentence level, and then can accurately identify different phonemes from the word level and sentence level.

需要说明的是，字级分类层和句级分类层可以设置于辅助模型中，也就是第二识别模型中不包含字级分类层和句级分类层，该辅助模型用于从颗粒度较大的字级和/或句子级层面辅助训练第二识别模型，使得第二识别模型能够进一步从字级和/或句子级层面准确识别不同音素，提高音素识别效果。It should be noted that the word-level classification layer and the sentence-level classification layer can be set in the auxiliary model, that is, the second recognition model does not include the word-level classification layer and the sentence-level classification layer. The word-level and/or sentence-level level assists in training the second recognition model, so that the second recognition model can further accurately identify different phonemes from the word-level and/or sentence-level level, and improve the phoneme recognition effect.

作为一种可选实施例，辅助模型可以包括字级特征提取层、字级分类层、句级特征提取层和语种分类层。其中，字级特征提取层用于对第一音素隐层特征进行滑窗，得到字级隐层特征。字级分类层用于基于字级隐层特征进行字符识别，得到字级预测结果。句级特征提取层用于对字级特征提取层输出的字级隐层特征进行池化，得到句级隐层特征。语种分类层用于基于句级隐层特征进行语种识别，得到语种预测结果。As an optional embodiment, the auxiliary model may include a word-level feature extraction layer, a word-level classification layer, a sentence-level feature extraction layer, and a language classification layer. Among them, the word-level feature extraction layer is used to perform a sliding window on the first phoneme hidden layer feature to obtain the word-level hidden layer feature. The word-level classification layer is used to perform character recognition based on word-level hidden layer features, and obtain word-level prediction results. The sentence-level feature extraction layer is used to pool the word-level hidden layer features output by the word-level feature extraction layer to obtain sentence-level hidden layer features. The language classification layer is used for language identification based on the sentence-level hidden layer features to obtain language prediction results.

基于上述任一实施例，基于第一音素隐层特征，确定字级隐层特征和/或句级隐层特征，包括：Based on any of the above-mentioned embodiments, based on the first phoneme hidden layer features, determine word-level hidden layer features and/or sentence-level hidden layer features, including:

对第一音素隐层特征进行滑窗，得到字级隐层特征；Perform a sliding window on the hidden layer features of the first phoneme to obtain word-level hidden layer features;

对字级隐层特征进行池化，得到句级隐层特征。The word-level hidden layer features are pooled to obtain sentence-level hidden layer features.

具体地，由于基于字级隐层特征进行字符识别的颗粒度大于基于第一音素隐层特征进行音素识别的颗粒度，因此需要对第一音素隐层特征进行滑窗操作，如可以设定窗长为B，每次取B帧第一音素隐层特征送入神经网络，经过神经网络抽象出字级隐层特征后，再将字级隐层特征进入字级分类层，得到字级预测结果。基于句级隐层特征进行语种识别的颗粒度相比于基于字级隐层特征进行字符识别的颗粒度更大，因此可以将字级隐层特征通过神经网络的多次池化生成句级隐层特征，并将句级隐层特征输入至语种分类层，得到语种预测结果。Specifically, since the granularity of character recognition based on word-level hidden layer features is greater than that of phoneme recognition based on the first phoneme hidden layer features, it is necessary to perform a sliding window operation on the first phoneme hidden layer features, for example, the window can be set The length is B, each time the first phoneme hidden layer feature of B frame is sent to the neural network, after the neural network abstracts the word-level hidden layer features, then the word-level hidden layer features are entered into the word-level classification layer, and the word-level prediction results are obtained . The granularity of language recognition based on sentence-level hidden layer features is larger than that of character recognition based on word-level hidden layer features, so word-level hidden layer features can be pooled multiple times through neural networks to generate sentence-level hidden layer features. layer features, and input sentence-level hidden layer features to the language classification layer to obtain language prediction results.

基于上述任一实施例，图6是本发明提供的音素识别模型训练方法的流程示意图，如图6所示，音素识别模型的训练步骤包括：Based on any of the above-mentioned embodiments, FIG. 6 is a schematic flow chart of a phoneme recognition model training method provided by the present invention. As shown in FIG. 6, the training steps of the phoneme recognition model include:

步骤610、固定第一识别模型的特征提取层的参数；Step 610, fixing the parameters of the feature extraction layer of the first recognition model;

步骤620、将各语种的样本语音输入至第一识别模型的特征提取层，得到第一识别模型的特征提取层输出的第二音素隐层特征；Step 620, input the sample speeches of various languages into the feature extraction layer of the first recognition model, and obtain the second phoneme hidden layer features output by the feature extraction layer of the first recognition model;

步骤630、将第二音素隐层特征输入至当前音素分类层，得到当前音素分类层输出的第二音素预测结果；当前音素分类层基于从第二识别模型中筛选得到的音素节点构建得到；Step 630: Input the second phoneme hidden layer feature into the current phoneme classification layer to obtain the second phoneme prediction result output by the current phoneme classification layer; the current phoneme classification layer is constructed based on the phoneme nodes screened from the second recognition model;

步骤640、基于音素级标签与第二音素预测结果之间的差异，对当前音素分类层进行参数迭代，得到音素识别模型。Step 640: Perform parameter iteration on the current phoneme classification layer based on the difference between the phoneme-level label and the second phoneme prediction result to obtain a phoneme recognition model.

具体地，第一识别模型的特征提取层即为训练完成的第二识别模型的特征提取层，由于训练完成的第二识别模型具备能够准确进行特征提取的能力，从而在得到第一识别模型，并固定第一识别模型的特征提取层的参数后，第一识别模型能够保留训练完成的第二识别模型的特征提取能力。Specifically, the feature extraction layer of the first recognition model is the feature extraction layer of the trained second recognition model. Since the trained second recognition model has the ability to accurately perform feature extraction, after obtaining the first recognition model, And after fixing the parameters of the feature extraction layer of the first recognition model, the first recognition model can retain the feature extraction capability of the trained second recognition model.

第一识别模型的特征提取层用于提取各语种的样本语音对应的第二音素隐层特征。由于第一识别模型的特征提取层即为训练完成的第二识别模型的特征提取层，从而第一识别模型的特征提取层能够准确提取得到第二音素隐层特征。The feature extraction layer of the first recognition model is used to extract the second phoneme hidden layer features corresponding to the sample speeches of various languages. Since the feature extraction layer of the first recognition model is the feature extraction layer of the trained second recognition model, the feature extraction layer of the first recognition model can accurately extract the features of the second phoneme hidden layer.

当前音素分类层基于从第二识别模型中筛选得到的音素节点构建得到，也就是当前音素分类层是在第二识别模型的音素分类层的基础上滤除了冗余音素节点，即当前音素分类层的音素节点数量小于第二识别模型的音素分类层的音素节点数量。同样地，当前音素分类层用于基于第二音素隐层特征进行音素识别，得到第二音素预测结果。The current phoneme classification layer is constructed based on the phoneme nodes screened from the second recognition model, that is, the current phoneme classification layer filters redundant phoneme nodes on the basis of the phoneme classification layer of the second recognition model, that is, the current phoneme classification layer The number of phoneme nodes in is less than the number of phoneme nodes in the phoneme classification layer of the second recognition model. Similarly, the current phoneme classification layer is used to perform phoneme recognition based on the second phoneme hidden layer features to obtain a second phoneme prediction result.

在得到第二音素预测结果之后，基于音素级标签与第二音素预测结果之间的差异，对当前音素分类层进行参数迭代，使得音素识别模型在训练过程中能够尽量学习各语种下不同类别音素的信息，从而使得音素识别模型能够准确对各语种下的音素。After the second phoneme prediction result is obtained, based on the difference between the phoneme-level label and the second phoneme prediction result, the parameters of the current phoneme classification layer are iterated, so that the phoneme recognition model can learn different types of phonemes in each language during the training process. information, so that the phoneme recognition model can accurately identify phonemes in various languages.

由此可见，本发明实施例基于音素级标签与第二音素预测结果之间的差异，对当前音素分类层进行参数迭代，能够使得训练得到的音素识别模型准确识别各语种下的音素。It can be seen that the embodiment of the present invention performs parameter iteration on the current phoneme classification layer based on the difference between the phoneme-level label and the second phoneme prediction result, so that the trained phoneme recognition model can accurately recognize phonemes in various languages.

作为一种可选实施例，第一识别模型中的特征提取层可以包括第二编码层和第二注意力层，第二编码层用于对各语种的样本语音进行编码，得到各样本语音的第二编码特征，第二注意力层用于基于注意力机制，对各样本语音的第二编码特征进行注意力变换，得到第二音素隐层特征。此外，第二识别模型中的当前音素分类层可以包括第二解码层和第二识别层，第二解码层用于对第二音素隐层特征进行解码，得到各样本语音的第二解码特征，第二识别层用于基于各样本语音的第二解码特征进行音素识别，得到第二音素预测结果。As an optional embodiment, the feature extraction layer in the first recognition model may include a second encoding layer and a second attention layer, and the second encoding layer is used to encode sample speeches of various languages to obtain the The second coding feature, the second attention layer is used to perform attention transformation on the second coding feature of each sample speech based on the attention mechanism, so as to obtain the second phoneme hidden layer feature. In addition, the current phoneme classification layer in the second recognition model may include a second decoding layer and a second recognition layer, and the second decoding layer is used to decode the second phoneme hidden layer features to obtain the second decoding features of each sample speech, The second recognition layer is used to perform phoneme recognition based on the second decoding feature of each sample speech, and obtain a second phoneme prediction result.

基于上述任一实施例，音素识别模型基于多个语种的样本语音及各样本语音的音素级标签，对第一识别模型进行训练得到。第一识别模型是基于第二识别模型下各音素节点所对应音素之间的相似度，对第二识别模型下的音素节点进行筛选得到的。本发明还提供一种音素识别模型训练方法，该方法包括：Based on any of the above embodiments, the phoneme recognition model is obtained by training the first recognition model based on sample speeches of multiple languages and phoneme-level labels of each sample speech. The first recognition model is obtained by screening the phoneme nodes under the second recognition model based on the similarity between phonemes corresponding to each phoneme node under the second recognition model. The present invention also provides a phoneme recognition model training method, the method comprising:

图7是本发明提供的再一第二识别模型训练方法的流程示意图，如图7所示，第二识别模型包括特征提取层和多个语种分别对应的音素分类层，各音素分类层基于各语种对应的音素节点构建得到。基于多个语种的样本语音及各样本语音的音素级标签训练得到第二识别模型，具体为：将各语种的样本语音输入至第二识别模型的特征提取层，得到第一音素隐层特征，并将第一音素隐层特征输入至各语种的音素分类层，得到第一音素预测结果，基于音素级标签与第一音素预测结果之间的差异，对第二识别模型的特征提取层和各语种的音素分类层进行参数迭代，得到初始第二识别模型。Fig. 7 is a schematic flow chart of another second recognition model training method provided by the present invention. As shown in Fig. 7, the second recognition model includes a feature extraction layer and phoneme classification layers corresponding to a plurality of languages, and each phoneme classification layer is based on each The phoneme node corresponding to the language is constructed. The second recognition model is obtained based on the sample speech of multiple languages and the phoneme-level label training of each sample speech, specifically: input the sample speech of each language to the feature extraction layer of the second recognition model to obtain the first phoneme hidden layer features, And input the features of the first phoneme hidden layer to the phoneme classification layer of each language to obtain the first phoneme prediction result. Based on the difference between the phoneme-level label and the first phoneme prediction result, the feature extraction layer of the second recognition model and each The language phoneme classification layer performs parameter iteration to obtain the initial second recognition model.

在得到初始第二识别模型后，基于第一音素隐层特征，确定字级隐层特征和句级隐层特征，将字级隐层特征输入至辅助模型的字级分类层，得到字级预测结果，以及将句级隐层特征输入至辅助模型的语种分类层，得到语种预测结果。基于字级标签与字级预测结果之间的差异和语种标签与语种预测结果之间的差异，对初始第二识别模型的特征提取层进行参数迭代，得到第二识别模型。After obtaining the initial second recognition model, based on the first phoneme hidden layer features, determine the word-level hidden layer features and sentence-level hidden layer features, and input the word-level hidden layer features into the word-level classification layer of the auxiliary model to obtain word-level predictions As a result, the sentence-level hidden layer features are input to the language classification layer of the auxiliary model to obtain language prediction results. Based on the difference between the word-level label and the word-level prediction result and the difference between the language label and the language prediction result, the parameter iteration is performed on the feature extraction layer of the initial second recognition model to obtain the second recognition model.

接着，基于各音素节点所对应音素之间的相似度，对第二识别模型下的各音素节点进行聚类，得到多个簇类，并保留各簇类中的任意一个音素节点以及删除各簇类中的其它音素节点，得到第一识别模型。Then, based on the similarity between the phonemes corresponding to each phoneme node, each phoneme node under the second recognition model is clustered to obtain multiple clusters, and any phoneme node in each cluster is retained and each cluster is deleted other phoneme nodes in the class to obtain the first recognition model.

在得到第一识别模型后，固定第一识别模型的特征提取层的参数。将各语种的样本语音输入至第一识别模型的特征提取层，得到第二音素隐层特征，将第二音素隐层特征输入至当前音素分类层，得到第二音素预测结果；其中，当前音素分类层基于从第二识别模型中筛选得到的音素节点构建得到。After obtaining the first recognition model, the parameters of the feature extraction layer of the first recognition model are fixed. Input the sample speech of each language to the feature extraction layer of the first recognition model to obtain the second phoneme hidden layer feature, and input the second phoneme hidden layer feature to the current phoneme classification layer to obtain the second phoneme prediction result; wherein, the current phoneme The classification layer is constructed based on the phoneme nodes screened from the second recognition model.

最后，基于音素级标签与第二音素预测结果之间的差异，对当前音素分类层进行参数迭代，得到音素识别模型，该音素识别模型不仅规模较小，而且能够准确对不同语种的音素进行区分。Finally, based on the difference between the phoneme-level label and the second phoneme prediction result, the parameters of the current phoneme classification layer are iterated to obtain a phoneme recognition model. The phoneme recognition model is not only small in scale, but also can accurately distinguish phonemes in different languages. .

下面对本发明提供的音素识别装置进行描述，下文描述的音素识别装置与上文描述的音素识别方法可相互对应参照。The phoneme recognition device provided by the present invention is described below, and the phoneme recognition device described below and the phoneme recognition method described above can be referred to in correspondence.

基于上述任一实施例，图8是本发明提供的音素识别装置的结构示意图，如图8所示，该装置包括：Based on any of the above-mentioned embodiments, FIG. 8 is a schematic structural diagram of a phoneme recognition device provided by the present invention. As shown in FIG. 8, the device includes:

确定单元810，用于确定待识别语音；A determining unit 810, configured to determine the speech to be recognized;

识别单元820，用于述待识别语音输入至音素识别模型，得到所述音素识别模型输出的音素识别结果；The recognition unit 820 is used to input the speech to be recognized into the phoneme recognition model, and obtain the phoneme recognition result output by the phoneme recognition model;

基于上述任一实施例，所述装置还包括：Based on any of the above-mentioned embodiments, the device further includes:

聚类单元，用于基于各音素节点所对应音素之间的相似度，对所述第二识别模型下的各音素节点进行聚类，得到多个簇类；A clustering unit, configured to cluster each phoneme node under the second recognition model based on the similarity between the phonemes corresponding to each phoneme node, to obtain a plurality of clusters;

剪枝单元，用于从各簇类中的音素节点筛选得到当前音素节点，并删除各簇类中除当前音素节点以外的其它音素节点，得到所述第一识别模型。The pruning unit is configured to filter phoneme nodes in each cluster to obtain the current phoneme node, and delete other phoneme nodes in each cluster except the current phoneme node to obtain the first recognition model.

基于上述任一实施例，所述第二识别模型包括特征提取层和多个语种分别对应的音素分类层，各音素分类层基于各语种对应的音素节点构建得到；Based on any of the above embodiments, the second recognition model includes a feature extraction layer and phoneme classification layers corresponding to multiple languages, and each phoneme classification layer is constructed based on phoneme nodes corresponding to each language;

所述装置还包括：The device also includes:

第一特征提取单元，用于将各语种的样本语音输入至所述第二识别模型的特征提取层，得到所述第二识别模型的特征提取层输出的第一音素隐层特征；The first feature extraction unit is used to input the sample speech of each language to the feature extraction layer of the second recognition model, and obtain the first phoneme hidden layer feature output by the feature extraction layer of the second recognition model;

第一音素分类单元，用于将所述第一音素隐层特征输入至各语种的音素分类层，得到各语种的音素分类层输出的第一音素预测结果；The first phoneme classification unit is configured to input the first phoneme hidden layer feature to the phoneme classification layer of each language, and obtain the first phoneme prediction result output by the phoneme classification layer of each language;

第一参数迭代单元，用于基于所述音素级标签与所述第一音素预测结果之间的差异，对所述第二识别模型的特征提取层和各语种的音素分类层进行参数迭代，得到所述第二识别模型。The first parameter iteration unit is configured to perform parameter iteration on the feature extraction layer of the second recognition model and the phoneme classification layer of each language based on the difference between the phoneme-level label and the first phoneme prediction result, to obtain The second recognition model.

特征确定单元，用于得到所述第二识别模型的特征提取层输出的第一音素隐层特征之后，基于所述第一音素隐层特征，确定字级隐层特征和/或句级隐层特征；The feature determination unit is configured to determine word-level hidden layer features and/or sentence-level hidden layer features based on the first phoneme hidden layer features after obtaining the first phoneme hidden layer features output by the feature extraction layer of the second recognition model feature;

第二参数迭代单元，用于基于所述样本语音的字级标签与字级预测结果之间的差异和/或所述样本语音的语种标签与语种预测结果之间的差异，对所述第二识别模型的特征提取层进行参数迭代，得到所述第二识别模型；所述字级预测结果基于所述字级隐层特征确定，所述语种预测结果基于所述句级隐层特征确定。The second parameter iteration unit is configured to, based on the difference between the word-level label of the sample speech and the word-level prediction result and/or the difference between the language label of the sample speech and the language prediction result, The feature extraction layer of the recognition model performs parameter iteration to obtain the second recognition model; the word-level prediction result is determined based on the word-level hidden layer features, and the language type prediction result is determined based on the sentence-level hidden layer features.

基于上述任一实施例，所述第二参数迭代单元，包括：Based on any of the above-mentioned embodiments, the second parameter iteration unit includes:

辅助预测单元，用于将所述字级隐层特征输入至字级分类层，得到所述字级分类层输出的所述字级预测结果，和/或，将所述句级隐层特征输入至语种分类层，得到所述语种分类层输出的所述语种预测结果；An auxiliary prediction unit, configured to input the word-level hidden layer features to the word-level classification layer, obtain the word-level prediction result output by the word-level classification layer, and/or input the sentence-level hidden layer features To the language classification layer, the language prediction result output by the language classification layer is obtained;

辅助训练单元，用于基于所述字级标签与所述字级预测结果之间的差异和/或所述语种标签与所述语种预测结果之间的差异，对所述第二识别模型的特征提取层进行参数迭代，得到所述第二识别模型。Auxiliary training unit, used to train the features of the second recognition model based on the difference between the word-level label and the word-level prediction result and/or the difference between the language label and the language prediction result The extraction layer performs parameter iteration to obtain the second recognition model.

基于上述任一实施例，所述特征确定单元，包括：Based on any of the above-mentioned embodiments, the feature determination unit includes:

滑窗单元，用于对所述第一音素隐层特征进行滑窗，得到所述字级隐层特征；A sliding window unit, configured to perform a sliding window on the first phoneme hidden layer feature to obtain the word-level hidden layer feature;

池化单元，用于对所述字级隐层特征进行池化，得到所述句级隐层特征。A pooling unit is configured to perform pooling on the word-level hidden layer features to obtain the sentence-level hidden layer features.

参数固定单元，用于固定所述第一识别模型的特征提取层的参数；A parameter fixing unit, configured to fix parameters of the feature extraction layer of the first recognition model;

第二特征提取单元，用于将各语种的样本语音输入至所述第一识别模型的特征提取层，得到所述第一识别模型的特征提取层输出的第二音素隐层特征；The second feature extraction unit is used to input the sample speech of each language to the feature extraction layer of the first recognition model, and obtain the second phoneme hidden layer feature output by the feature extraction layer of the first recognition model;

第二音素分类单元，用于将所述第二音素隐层特征输入至当前音素分类层，得到所述当前音素分类层输出的第二音素预测结果；所述当前音素分类层基于从所述第二识别模型中筛选得到的音素节点构建得到；The second phoneme classification unit is configured to input the second phoneme hidden layer feature to the current phoneme classification layer to obtain the second phoneme prediction result output by the current phoneme classification layer; the current phoneme classification layer is based on the first phoneme classification layer. Two phoneme nodes screened in the recognition model are constructed;

第三参数迭代单元，用于基于所述音素级标签与所述第二音素预测结果之间的差异，对所述当前音素分类层进行参数迭代，得到所述音素识别模型。The third parameter iteration unit is configured to perform parameter iteration on the current phoneme classification layer based on the difference between the phoneme-level label and the second phoneme prediction result to obtain the phoneme recognition model.

图9是本发明提供的电子设备的结构示意图，如图9所示，该电子设备可以包括：处理器(processor)910、存储器(memory)920、通信接口(Communications Interface)930和通信总线940，其中，处理器910，存储器920，通信接口930通过通信总线940完成相互间的通信。处理器910可以调用存储器920中的逻辑指令，以执行音素识别方法，该方法包括：确定待识别语音；将所述待识别语音输入至音素识别模型，得到所述音素识别模型输出的音素识别结果；所述音素识别模型基于多个语种的样本语音及各样本语音的音素级标签，对第一识别模型进行训练得到，所述第一识别模型是基于第二识别模型下各音素节点所对应音素之间的相似度，对所述第二识别模型下的音素节点进行筛选得到的，所述第二识别模型包括多个语种分别对应的音素节点。FIG. 9 is a schematic structural diagram of an electronic device provided by the present invention. As shown in FIG. 9, the electronic device may include: a processor (processor) 910, a memory (memory) 920, a communication interface (Communications Interface) 930 and a communication bus 940, Wherein, the processor 910 , the memory 920 , and the communication interface 930 communicate with each other through the communication bus 940 . The processor 910 can call the logic instructions in the memory 920 to execute the phoneme recognition method, the method comprising: determining the speech to be recognized; inputting the speech to be recognized into the phoneme recognition model, and obtaining the phoneme recognition result output by the phoneme recognition model The phoneme recognition model is obtained by training the first recognition model based on the sample speech of multiple languages and the phoneme-level labels of each sample speech, and the first recognition model is based on the phoneme corresponding to each phoneme node under the second recognition model The similarity between them is obtained by filtering the phoneme nodes under the second recognition model, and the second recognition model includes phoneme nodes corresponding to multiple languages.

此外，上述的存储器920中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the above-mentioned logic instructions in the memory 920 may be implemented in the form of software function units and may be stored in a computer-readable storage medium when sold or used as an independent product. Based on this understanding, the essence of the technical solution of the present invention or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes. .

另一方面，本发明还提供一种计算机程序产品，所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序，所述计算机程序包括程序指令，当所述程序指令被计算机执行时，计算机能够执行上述各方法所提供的音素识别方法，该方法包括：确定待识别语音；将所述待识别语音输入至音素识别模型，得到所述音素识别模型输出的音素识别结果；所述音素识别模型基于多个语种的样本语音及各样本语音的音素级标签，对第一识别模型进行训练得到，所述第一识别模型是基于第二识别模型下各音素节点所对应音素之间的相似度，对所述第二识别模型下的音素节点进行筛选得到的，所述第二识别模型包括多个语种分别对应的音素节点。On the other hand, the present invention also provides a computer program product, the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer During execution, the computer can execute the phoneme recognition method provided by each of the above methods, the method comprising: determining the speech to be recognized; inputting the speech to be recognized to the phoneme recognition model to obtain the phoneme recognition result output by the phoneme recognition model; The phoneme recognition model is obtained by training the first recognition model based on the sample speeches of multiple languages and the phoneme-level labels of each sample speech. The similarity is obtained by screening the phoneme nodes under the second recognition model, and the second recognition model includes phoneme nodes corresponding to multiple languages.

又一方面，本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现以执行上述各提供的音素识别方法，该方法包括：确定待识别语音；将所述待识别语音输入至音素识别模型，得到所述音素识别模型输出的音素识别结果；所述音素识别模型基于多个语种的样本语音及各样本语音的音素级标签，对第一识别模型进行训练得到，所述第一识别模型是基于第二识别模型下各音素节点所对应音素之间的相似度，对所述第二识别模型下的音素节点进行筛选得到的，所述第二识别模型包括多个语种分别对应的音素节点。In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, it is implemented to perform the phoneme recognition methods provided above, and the method includes: determining The speech to be recognized; the speech to be recognized is input to the phoneme recognition model to obtain the phoneme recognition result output by the phoneme recognition model; the phoneme recognition model is based on the sample speech of multiple languages and the phoneme-level label of each sample speech. The first recognition model is obtained by training, and the first recognition model is obtained by screening the phoneme nodes under the second recognition model based on the similarity between phonemes corresponding to each phoneme node under the second recognition model, so The second recognition model includes phoneme nodes respectively corresponding to a plurality of languages.

以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed to multiple network elements. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. It can be understood and implemented by those skilled in the art without any creative efforts.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the above description of the implementations, those skilled in the art can clearly understand that each implementation can be implemented by means of software plus a necessary general hardware platform, and of course also by hardware. Based on this understanding, the essence of the above technical solution or the part that contributes to the prior art can be embodied in the form of software products, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic discs, optical discs, etc., including several instructions to make a computer device (which may be a personal computer, server, or network device, etc.) execute the methods described in various embodiments or some parts of the embodiments.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still be Modifications are made to the technical solutions described in the foregoing embodiments, or equivalent replacements are made to some of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the various embodiments of the present invention.

Claims

1. A phoneme recognition method, comprising:

determining a voice to be recognized;

inputting the voice to be recognized into a phoneme recognition model to obtain a phoneme recognition result output by the phoneme recognition model;

the phoneme recognition model is obtained by training a first recognition model based on sample voices of multiple languages and phoneme level labels of the sample voices, the first recognition model is obtained by screening phoneme nodes under a second recognition model based on similarity between phonemes corresponding to the phoneme nodes under the second recognition model, and the second recognition model comprises phoneme nodes corresponding to the multiple languages respectively.

2. The phoneme recognition method of claim 1, wherein the step of determining the first recognition model comprises:

clustering the phoneme nodes under the second recognition model based on the similarity between the phonemes corresponding to the phoneme nodes to obtain a plurality of cluster classes;

and screening the phoneme nodes in each cluster type to obtain a current phoneme node, and deleting other phoneme nodes except the current phoneme node in each cluster type to obtain the first identification model.

3. The phoneme recognition method of claim 1, wherein the second recognition model comprises a feature extraction layer and phoneme classification layers corresponding to a plurality of languages, and each phoneme classification layer is constructed based on phoneme nodes corresponding to each language;

the second recognition model is obtained by training based on the following steps:

inputting sample voices of various languages into the feature extraction layer of the second recognition model to obtain first phoneme hidden layer features output by the feature extraction layer of the second recognition model;

inputting the first phoneme hidden layer characteristics to the phoneme classification layers of all languages to obtain first phoneme prediction results output by the phoneme classification layers of all languages;

and performing parameter iteration on a feature extraction layer of the second recognition model and a phoneme classification layer of each language based on the difference between the phoneme level label and the first phoneme prediction result to obtain the second recognition model.

4. The method for phoneme recognition of claim 3, wherein said obtaining the first phoneme hidden layer features outputted from the feature extraction layer of the second recognition model further comprises:

determining word-level hidden layer features and/or sentence-level hidden layer features based on the first phoneme hidden layer features;

performing parameter iteration on a feature extraction layer of the second recognition model based on the difference between the word-level label and the word-level prediction result of the sample voice and/or the difference between the language label and the language prediction result of the sample voice to obtain the second recognition model; the word-level prediction result is determined based on the word-level hidden layer characteristics, and the language prediction result is determined based on the sentence-level hidden layer characteristics.

5. The phoneme recognition method of claim 4, wherein the performing parameter iteration on the feature extraction layer of the second recognition model based on the difference between the word-level tag of the sample speech and the word-level prediction result and/or the difference between the language tag of the sample speech and the language prediction result to obtain the second recognition model comprises:

inputting the character-level hidden layer characteristics to a character-level classification layer to obtain the character-level prediction result output by the character-level classification layer, and/or inputting the sentence-level hidden layer characteristics to a language classification layer to obtain the language prediction result output by the language classification layer;

and performing parameter iteration on the feature extraction layer of the second recognition model based on the difference between the word-level label and the word-level prediction result and/or the difference between the language label and the language prediction result to obtain the second recognition model.

6. The phoneme recognition method of claim 4, wherein said determining word-level hidden layer features and/or sentence-level hidden layer features based on said first phoneme hidden layer features comprises:

performing sliding window on the first phoneme hidden layer characteristics to obtain the word-level hidden layer characteristics;

pooling the word-level hidden layer characteristics to obtain the sentence-level hidden layer characteristics.

7. The phoneme recognition method of claim 3, wherein the phoneme recognition model is trained based on the following steps:

fixing parameters of a feature extraction layer of the first recognition model;

inputting sample voices of various languages into the feature extraction layer of the first recognition model to obtain second phoneme hidden layer features output by the feature extraction layer of the first recognition model;

inputting the second phoneme hidden layer characteristics to a current phoneme classification layer to obtain a second phoneme prediction result output by the current phoneme classification layer; the current phoneme classification layer is constructed on the basis of phoneme nodes obtained by screening from the second recognition model;

and performing parameter iteration on the current phoneme classification layer based on the difference between the phoneme level label and the second phoneme prediction result to obtain the phoneme recognition model.

8. A phoneme recognition apparatus, comprising:

a determination unit configured to determine a speech to be recognized;

the recognition unit is used for inputting the speech to be recognized into the phoneme recognition model to obtain a phoneme recognition result output by the phoneme recognition model;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the phoneme recognition method of any one of claims 1 to 7 when executing the program.

10. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the phoneme recognition method of any one of claims 1 to 7.