CN111063335B

CN111063335B - End-to-end tone recognition method based on neural network

Info

Publication number: CN111063335B
Application number: CN201911310349.4A
Authority: CN
Inventors: 黄浩; 王凯; 胡英
Original assignee: Xinjiang University
Current assignee: Xinjiang University
Priority date: 2019-12-18
Filing date: 2019-12-18
Publication date: 2022-08-09
Anticipated expiration: 2039-12-18
Also published as: CN111063335A

Abstract

The invention discloses an end-to-end tone recognition method based on a neural network. The speech recognition acoustic model is trained on the set, and the start and end times of each syllable are obtained by forced alignment; the selected training speech data and the tone label of each syllable are sent to the end-to-end tone recognition model for training optimization, and the optimized Neural network model parameters; continuously adjust the neural network model parameters to select the optimal network model parameters; obtain test speech, under the condition of given sentence content text, use forced alignment to obtain the start and end time of each syllable; if not given, Use automatic speech recognition to obtain the start and end time of each syllable; send the selected training speech data and syllable time stamps to the end-to-end tone recognition model for recognition, and finally obtain the tone type of each syllable in each test data .

Description

An End-to-End Tone Recognition Method Based on Neural Networks

技术领域technical field

本发明涉及中文普通话声调识别领域，尤其涉及一种基于神经网络的端到端声调识别方法。The invention relates to the field of Mandarin Chinese tone recognition, in particular to an end-to-end tone recognition method based on a neural network.

背景技术Background technique

随着人工智能技术的快速发展，对于语音技术的研究也在不断深入，包含语音识别、语音合成、语音分离、语音转换及说话人识别等技术领域，在这些涉及领域内通过实验发现语音的音调对有调语言的实验结果有很大的影响，在中文普通话中，声调分为五种，分别为阴平，阳平，上声，去声以及没有调，汉语中所说为一声平(—)、二声扬(/)、三声拐弯(∨)、四声降(\)和无声调。在普通话中声调是非常重要的一部分，如果声调错误，就会发生歧义，使语音理解出现错误，对普通话的声调的研究很有必要。声调识别是语音领域的一个重要的研究方向，其主要目的是能较准确地得到有调语言语音的声调，提高语音识别、语音合成等任务的精确度。传统的声调识别采用经典的分类算法，即前端的特征提取和后端分类器。传统的声调分类则包含作为特征的基频特征的获取，和对声调分类独立的两个阶段。With the rapid development of artificial intelligence technology, the research on speech technology is also deepening, including speech recognition, speech synthesis, speech separation, speech conversion and speaker recognition and other technical fields. In these fields, the pitch of speech is found through experiments It has a great influence on the experimental results of tonal languages. In Mandarin Chinese, there are five tones, namely Yinping, Yangping, Shangsheng, Qusheng and no tones. Two-tone up (/), three-tone bend (∨), four-tone down (\) and no tone. Tone is a very important part in Mandarin. If the tone is wrong, ambiguity will occur, which will lead to errors in speech comprehension. It is necessary to study the tone of Mandarin. Tone recognition is an important research direction in the field of speech. Its main purpose is to obtain the tones of tonal speech more accurately, and to improve the accuracy of tasks such as speech recognition and speech synthesis. The traditional tone recognition adopts the classical classification algorithm, that is, the front-end feature extraction and the back-end classifier. The traditional tone classification includes the acquisition of fundamental frequency features as features, and two separate stages for tone classification.

针对基频特征的提取方法，可以采用时域分析法、频域分析法或者混合法，时域分析法包含自相关法、平均幅度差法等，频域分析法包括倒谱法等。这些方法均为人工设计的启发式的基频提取算法，算法的设置均为人工依靠实验语音学进行经验设定。针对后端的声调分类模型，在对声调进行分类时，主要采用传统模式识别中的分类器模型，比如支持向量机模型、高斯混合模型、决策树模型、高斯混合模型-隐马尔可夫模型、条件随机场模型、或者神经网络模型等等。For the extraction method of fundamental frequency features, time domain analysis method, frequency domain analysis method or hybrid method can be adopted. Time domain analysis method includes autocorrelation method, average amplitude difference method, etc., and frequency domain analysis method includes cepstrum method. These methods are all artificially designed heuristic fundamental frequency extraction algorithms, and the settings of the algorithms are artificially set based on experimental phonetics. For the back-end tone classification model, when classifying tone, the classifier models in traditional pattern recognition are mainly used, such as support vector machine model, Gaussian mixture model, decision tree model, Gaussian mixture model-Hidden Markov model, conditional Random field model, or neural network model, etc.

上述声调分类模型分为两类：基于帧特征的声调分类模型和基于段特征的声调分类模型。基于帧特征的声调分类模型包括高斯混合模型-隐马尔可夫模型、条件随机场模型；基于段特征的声调分类模型有支持向量机模型、高斯混合模型、决策树模型等。基于帧特征的声调分类模型可以对提取出的基频相关特征直接进行处理，以可变长序列的形式作为输入计算给定输入序列条件下的声调模型后验概率进行声调分类。基于段特征的方法只能处理固定维度的输入特征向量，因此需要先提出基频相关的特征序列，然后将基频特征序列利用人工方法转化成一个包含声调信息的固定维度观测向量，再送入基于段特征的分类器训练出声调模型，最后将要测试的声调数据根据声调模型进行分类，获得正确的识别结果，从而完成整个声调识别的流程。The above tone classification models are classified into two categories: tone classification models based on frame features and tone classification models based on segment features. Tone classification models based on frame features include Gaussian mixture model-hidden Markov model, conditional random field model; tone classification models based on segment features include support vector machine model, Gaussian mixture model, decision tree model and so on. The tone classification model based on frame features can directly process the extracted fundamental frequency related features, and use the form of variable-length sequence as input to calculate the posterior probability of the tone model under the condition of the given input sequence for tone classification. The segment feature-based method can only deal with the input feature vector of fixed dimension, so it is necessary to first propose the feature sequence related to the fundamental frequency, and then use the artificial method to convert the fundamental frequency feature sequence into a fixed-dimensional observation vector containing the tone information, and then send it to the base frequency feature sequence. The classifier of segment features trains the tone model, and finally classifies the tone data to be tested according to the tone model to obtain correct recognition results, thereby completing the entire tone recognition process.

目前现有声调分类技术存在以下两个主要问题：There are two main problems in the current tone classification technology:

一方面是传统基频提取方法还不够完善，提取出的基频值不够准确。这种基频提取的不准确性导致在后续进行声调分类时，得到的分类结果也不够准确；On the one hand, the traditional fundamental frequency extraction method is not perfect enough, and the extracted fundamental frequency value is not accurate enough. The inaccuracy of this fundamental frequency extraction leads to inaccurate classification results in subsequent tone classification;

另一方面，只使用传统基频提取方法得到的基频值在进行对声调分类时，只使用人工设计的基频特征不能完全包含有助于声调分类的信息，也将会使最后的声调分类结果不一定能够达到最优。传统声调识别的方法分为特征提取和分类器训练两个阶段，每一个阶段都需要大量的调参，以上都使得两阶段的声调识别不一定能够达到总体的最优结果。On the other hand, only using the fundamental frequency value obtained by the traditional fundamental frequency extraction method when classifying the tone, only using the artificially designed fundamental frequency feature cannot completely contain the information that is helpful for tone classification, and will also make the final tone classification. The results may not be optimal. The traditional tone recognition method is divided into two stages: feature extraction and classifier training. Each stage requires a lot of tuning parameters, which makes the two-stage tone recognition not necessarily able to achieve the overall optimal results.

发明内容SUMMARY OF THE INVENTION

本发明提供了一种基于神经网络的端到端声调识别方法，将传统的先提取基频特征构架和在后的声调分类架构联合学习，进行端到端的声调识别，实现对声调的精确分类，详见下文描述：The invention provides an end-to-end tone recognition method based on a neural network, which jointly learns the traditional fundamental frequency feature structure extracted first and the tone classification architecture later, performs end-to-end tone recognition, and realizes accurate classification of the tone. See the description below for details:

一种基于神经网络的端到端声调识别方法，所述方法包括：An end-to-end tone recognition method based on a neural network, the method comprising:

一、训练声调识别系统模型：1. Training the tone recognition system model:

构建端到端声调识别模型，确定神经网络的层数、隐含层节点数等所需的各项超参数；Build an end-to-end tone recognition model, and determine various hyperparameters required for the number of layers of the neural network and the number of hidden layer nodes;

在训练集上训练语音识别声学模型，利用强制对齐获得每个音节的开始和结束时间；Train a speech recognition acoustic model on the training set, using forced alignment to obtain the start and end times of each syllable;

将选取的训练语音数据以及每个音节的声调标签送入到端到端声调识别模型进行训练优化，获取优化的神经网络模型参数；The selected training speech data and the tone label of each syllable are sent to the end-to-end tone recognition model for training optimization, and the optimized neural network model parameters are obtained;

不断调节神经网络模型参数，选取最优的网络模型参数；Constantly adjust the parameters of the neural network model and select the optimal network model parameters;

二、声调识别：2. Tone recognition:

获得测试语音，在给定句子内容文本条件下，利用强制对齐获得每个音节的开始和结束时间；未给定时，使用自动语音识别获得每个音节的开始和结束时间；Obtain the test speech, and use forced alignment to obtain the start and end time of each syllable under the condition of given sentence content text; if not given, use automatic speech recognition to obtain the start and end time of each syllable;

对选取的训练语音数据送入到端到端声调识别模型中进行识别，最终得到每个测试数据中每个音节的声调类型。The selected training speech data is sent to the end-to-end tone recognition model for identification, and finally the tone type of each syllable in each test data is obtained.

所述方法构建可训练的深度神经网络模型，再将基频提取神经网络与声调解码神经网络相结合，形成一个端到端的神经网络声调分类模型，这两部分网络参数在训练阶段同时训练调优。The method constructs a trainable deep neural network model, and then combines the fundamental frequency extraction neural network with the tone decoding neural network to form an end-to-end neural network tone classification model, and the two parts of the network parameters are trained and tuned simultaneously in the training phase. .

其中，所述基频提取神经网络为基于循环神经网络的编码器-解码器，该网络分为基频编码器网络和基频解码器网络两个部分。Wherein, the fundamental frequency extraction neural network is an encoder-decoder based on a recurrent neural network, and the network is divided into two parts: a fundamental frequency encoder network and a fundamental frequency decoder network.

进一步地，所述基频编码器网络利用循环神经网络将语音进行编码，基频解码器网络从语音的最后一帧开始预测基频标签，根据预测出的基频标签，将其转换为可训练的基频嵌入向量，通过前一时刻的基频编码预测前一时刻的基频标签，直至第一帧的基频标签预测完毕为止；Further, the base frequency encoder network uses a recurrent neural network to encode the speech, and the base frequency decoder network starts to predict the base frequency label from the last frame of the speech, and converts it into a trainable according to the predicted base frequency label. The fundamental frequency embedding vector of , predicts the fundamental frequency label of the previous moment through the fundamental frequency coding of the previous moment, until the fundamental frequency label of the first frame is predicted;

预测完每帧的声调标签后，利用预先定义的标签与频率对应关系转换为整个语音的基频值序列。After predicting the tone label of each frame, it is converted into the fundamental frequency value sequence of the whole speech by using the pre-defined label and frequency correspondence.

其中，所述声调解码神经网络分为两个部分：声调表示网络以及标签相关的声调分类网络；Wherein, the tone decoding neural network is divided into two parts: a tone representation network and a label-related tone classification network;

所述声调表示网络将预测出的基频序列按照每个音节映射成为固定维度的向量；The tone representation network maps the predicted fundamental frequency sequence into a vector of fixed dimension according to each syllable;

所述声调分类网络根据上一个音节预测出的标签和当前音节的固定维度向量来预测当前音节的声调类型。The tone classification network predicts the tone type of the current syllable according to the predicted label of the previous syllable and the fixed dimension vector of the current syllable.

进一步地，所述声调分类网络根据上一个音节预测出的标签和当前音节的固定维度向量来预测当前音节的声调类型具体为：Further, the tone classification network predicts the tone type of the current syllable according to the label predicted by the previous syllable and the fixed dimension vector of the current syllable is specifically:

先根据第1个音节的固定维度表示与一个句子起始对应的固定维度相拼接送入声调分类网络预测第1个音节的声调类型；First, according to the fixed dimension representation of the first syllable and the fixed dimension corresponding to the beginning of a sentence, it is spliced and sent to the tone classification network to predict the tone type of the first syllable;

根据预测出来的第1个音节的声调类型转换为相应的声调标签再与第2个音节的固定维度表示作为联合输入送入声调分类网络得到第2个音节的声调类型；Convert the predicted tone type of the first syllable to the corresponding tone label and then send it to the tone classification network with the fixed dimension representation of the second syllable as a joint input to obtain the tone type of the second syllable;

如此循环，直至最后一个音节的声调被预测出来为止。This cycle is repeated until the tone of the last syllable is predicted.

本发明提供的技术方案的有益效果是：The beneficial effects of the technical scheme provided by the present invention are:

1、与传统的独立两阶段(基频特征提取与声调分类)的声调识别方法相比，本发明通过端到端的联合方法，可以减少人工设计算法的不足，使声调分类结果能够获得更优的分类结果；1. Compared with the traditional independent two-stage (basic frequency feature extraction and tone classification) tone recognition method, the present invention can reduce the shortage of artificially designed algorithms through the end-to-end joint method, so that the tone classification results can obtain better results. classification result;

2、本发明提高了普通话声调识别的精确度，打破传统的声调识别分为两个阶段(基频特征提取与声调分类)的框架，构造了一种端到端的声调模型。端到端的模型能够把传统声调识别的两个阶段所使用两个网络：即特征提取网络和声调分类网络作为一个网络整体进行联合训练，这种方法可以绕过人工设计的环节，整个模型的网络参数可以进行联合优化，从而能够得到汉语普通话声调识别的精确度，适用于带调语言的声调问题处理；2. The present invention improves the accuracy of Mandarin tone recognition, breaks the traditional frame of tone recognition divided into two stages (basic frequency feature extraction and tone classification), and constructs an end-to-end tone model. The end-to-end model can jointly train the two networks used in the two stages of traditional tone recognition: the feature extraction network and the tone classification network as a whole network. This method can bypass the manual design process, and the network of the entire model can be The parameters can be jointly optimized, so that the accuracy of Mandarin Chinese tone recognition can be obtained, which is suitable for tonal problems in tonal languages;

3、针对特征提取网络，本发明使用了一种编码器-解码器的基频提取网络，这种基频提取网络使用一种循环神经网络的编码器将整个输入序列映射为一个固定维度的特征向量，并根据这个特征向量按时间倒序推断出每一帧所对应的基频标签；3. For the feature extraction network, the present invention uses an encoder-decoder fundamental frequency extraction network, which uses a recurrent neural network encoder to map the entire input sequence into a feature of a fixed dimension vector, and infer the fundamental frequency label corresponding to each frame in reverse time order according to this feature vector;

4、针对声调识别网络采用了一种标签相关的声调解码网络，该解码网络预测当前声调标签时不仅使用当前音节的基频提取网络的特征，还使用了前一个声调预测出来的声调标签值，这使得声调识别时考虑了上下文的声调类型之间的影响，可以获得更好的声调识别结果。4. A label-related tone decoding network is adopted for the tone recognition network. When predicting the current tone label, the decoding network not only uses the fundamental frequency of the current syllable to extract the network features, but also uses the tone label value predicted by the previous tone. This makes it possible to take into account the influence of contextual tone types in tone recognition, and obtain better tone recognition results.

附图说明Description of drawings

图1为端到端声调识别网络总体框架图；Figure 1 is the overall framework diagram of the end-to-end tone recognition network;

图2为基频特征表示网络图；Fig. 2 is a network diagram of fundamental frequency feature representation;

图3为声调嵌入表示网络图；Figure 3 is a network diagram of tone embedding representation;

图4为声调预测网络图。Figure 4 is a diagram of the tone prediction network.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面对本发明实施方式作进一步地详细描述。In order to make the objectives, technical solutions and advantages of the present invention clearer, the embodiments of the present invention are further described in detail below.

本发明从普通话的孤立或者连续语音流中识别出每个音节发音所代表的声调类别，提出一种基于循环神经网络的编码器-解码器(Encoder-Decoder)的基频特征提取网络与声调标签相关识别网络结合的端到端声调识别方法，将传统的先由基频(Pitch或F0)特征获取架构和在后进行声调分类架构形成一个统一的网络模型，实现了无需显式提取基频特征的端到端的声调识别。The invention identifies the tone category represented by the pronunciation of each syllable from the isolated or continuous speech stream of Mandarin, and proposes a fundamental frequency feature extraction network and tone label based on an encoder-decoder (Encoder-Decoder) based on a cyclic neural network. The end-to-end tone recognition method combined with the correlation recognition network forms a unified network model from the traditional fundamental frequency (Pitch or F0) feature acquisition architecture and tone classification architecture later, which realizes the need for no explicit extraction of fundamental frequency features. end-to-end tone recognition.

声调识别的作用是能够识别出语音中的音调，获取语音中包含的音调信息，使获取的音调信息能够满足在语音识别、语音合成、语音转换等任务上的要求，使任务能够更准确的实现，同时对于一些学习第二种有调语言的非本母语的人来说，如外国人学习中文，声调识别将有助于纠正错误，提高学习的效率。The function of tone recognition is to be able to identify the tones in the speech, and to obtain the tone information contained in the speech, so that the acquired tone information can meet the requirements of tasks such as speech recognition, speech synthesis, and speech conversion, so that the tasks can be realized more accurately. At the same time, for some non-native speakers who are learning a second tonal language, such as foreigners learning Chinese, tone recognition will help correct mistakes and improve learning efficiency.

在基频特征提取网络中提出编码器-解码器的网络结构来考虑基频提取的准确性，在声调识别网络中采用声调标签相关的解码网络，最后将两个网络结合在一起作为一个整体网络，并且对参数进行联合优化。本方法能够使声调识别的结果更加精确，效率更高，在科学研究或是日常的应用都能够有很好的效果。The encoder-decoder network structure is proposed in the fundamental frequency feature extraction network to consider the accuracy of fundamental frequency extraction, the tone label-related decoding network is used in the tone recognition network, and finally the two networks are combined as a whole network , and jointly optimize the parameters. The method can make the result of tone recognition more accurate and more efficient, and can have a good effect in scientific research or daily application.

实施例1Example 1

为了解决上述问题，本发明采用一种基于神经网络的端到端的声调识别方法，该方法将传统的先提取基频相关特征然后再进行声调分类的两阶段分类问题转化为单一阶段的网络模型进行参数的联合调优，从而实现端到端的声调识别。In order to solve the above problem, the present invention adopts an end-to-end tone recognition method based on neural network, which converts the traditional two-stage classification problem of first extracting fundamental frequency related features and then carrying out tone classification into a single-stage network model for Joint tuning of parameters to achieve end-to-end tone recognition.

目前基于端到端的方法已经成为当前人工智能技术的研究热点，如端到端的语音识别、端到端的语音合成、端到端的语音转换等，端到端的技术可以减少实验超参数的人工设定并得到更优的性能。At present, end-to-end based methods have become the research hotspot of current artificial intelligence technology, such as end-to-end speech recognition, end-to-end speech synthesis, end-to-end speech conversion, etc. End-to-end technology can reduce the manual setting of experimental hyperparameters and reduce get better performance.

本发明克服了传统两阶段声调识别方法基频提取阶段和声调分类阶段单独设计的不足。将传统基频提取方法的人工启发式算法替换成一个数据驱动的可训练的深度神经网络模型，再将基频提取神经网络与声调解码神经网络相结合，最终形成一个端到端的神经网络声调分类模型。The invention overcomes the shortcomings of the separate design of the fundamental frequency extraction stage and the tone classification stage of the traditional two-stage tone recognition method. Replace the artificial heuristic algorithm of the traditional fundamental frequency extraction method with a data-driven trainable deep neural network model, and then combine the fundamental frequency extraction neural network with the tone decoding neural network, and finally form an end-to-end neural network tone classification Model.

本发明将端到端的声调识别深度神经网络模型分为两个子网络：基频提取网络部分和声调解码网络部分。对于基频提取网络，提出一种基于循环神经网络的编码器-解码器的基频提取模型网络。该网络分为基频编码器网络和基频解码器网络两个部分。基频编码器网络是一个循环神经网络，在给定输入序列时，基频编码器网络将输入序列映射为相同长度的隐含状态的序列。基频解码器网络是一个前馈神经网络，为预测当前时刻的基频标签，需要将后一时刻的基频标签和当前时刻的编码器隐状态作为基频解码器的共同输入来决定当前时刻的基频标签。The invention divides the end-to-end tone recognition deep neural network model into two sub-networks: a fundamental frequency extraction network part and a tone decoding network part. For the fundamental frequency extraction network, an encoder-decoder fundamental frequency extraction model network based on recurrent neural network is proposed. The network is divided into two parts: base frequency encoder network and base frequency decoder network. The baseband encoder network is a recurrent neural network that, given an input sequence, maps the input sequence to a sequence of hidden states of the same length. The baseband decoder network is a feedforward neural network. In order to predict the baseband label at the current moment, the baseband label at the next moment and the hidden state of the encoder at the current moment need to be used as the common input of the baseband decoder to determine the current moment. fundamental frequency label.

获得整句语音的基频标签后，通过标签与基频值的映射关系转换为连续的基频值的序列。这种映射关系可以采用固定的基频标签与基频值的非线性映射关系，也可以采用称为基频嵌入的方法，采用可训练的基频池，将预测出的基频标签转换为实数表示的基频值，再将该基频预测网络输出的基频值送入声调解码器进行解码。After obtaining the fundamental frequency label of the entire speech, it is converted into a sequence of continuous fundamental frequency values through the mapping relationship between the label and the fundamental frequency value. This mapping relationship can adopt a nonlinear mapping relationship between fixed fundamental frequency labels and fundamental frequency values, or a method called fundamental frequency embedding can be used to convert the predicted fundamental frequency labels into real numbers using a trainable fundamental frequency pool. Then the fundamental frequency value output by the fundamental frequency prediction network is sent to the tone decoder for decoding.

声调解码器可以采用传统的深度前馈神经网络、循环神经网络或者卷积神经网络等。声调解码器根据基频提取网络的输出来预测一句语音中每个音节的声调。本发明中提出一种基于声调标签相关的声调解码网络。该网络分为两个部分：声调表示网络以及标签相关的声调分类网络。声调表示网络将基频提取网络的每个音节对应的变长基频序列映射成为一个固定维度的向量。标签相关的声调分类网络根据当前音节的固定维度表示和前一个音节的标签类型来预测当前声调的类型。The tone decoder can use traditional deep feedforward neural network, recurrent neural network or convolutional neural network. The pitch decoder predicts the pitch of each syllable in a sentence based on the output of the fundamental frequency extraction network. The present invention proposes a tone decoding network based on tone label correlation. The network is divided into two parts: the tone representation network and the label-dependent tone classification network. The tone representation network maps the variable-length fundamental frequency sequence corresponding to each syllable of the fundamental frequency extraction network into a fixed-dimensional vector. The label-dependent tone classification network predicts the type of the current tone based on the fixed-dimensional representation of the current syllable and the label type of the previous syllable.

具体实施过程为，每个声调分类网络依照各个音节的顺序依次预测每个音节的标签：先根据第1个音节的固定维度表示与一个BOS(句子起始)对应的固定维度相拼接送入声调分类网络预测第1个音节的声调类型；根据预测出来的第1个音节的声调类型转换为相应的声调标签再与第2个音节的固定维度表示作为联合输入送入声调分类网络得到第2个音节的声调类型，如此循环，直至最后一个音节的声调被预测出来为止。The specific implementation process is that each tone classification network predicts the label of each syllable in turn according to the order of each syllable: first, according to the fixed dimension of the first syllable, the fixed dimension corresponding to a BOS (sentence start) is spliced and sent into the tone The classification network predicts the tone type of the first syllable; according to the predicted tone type of the first syllable, it is converted into the corresponding tone label and then sent to the tone classification network with the fixed dimension representation of the second syllable as joint input to obtain the second The tone type of the syllable, and so on, until the tone of the last syllable is predicted.

本发明针对汉语普通话的声调识别，提高了从原始音频中自动对语音中各个音节进行声调分类的能力。Aiming at the tone recognition of Mandarin Chinese, the invention improves the ability of automatically classifying the tone of each syllable in the speech from the original audio.

本发明的描述用来获取语音中包含的音调信息，使获取的音调信息能够满足在语音识别、语音合成、语音转换等任务上的要求，使任务能够更准确的实现。声调识别对于一些学习第二种有调语言的非本母语的人来说，如外国人学习中文，声调识别将有助于纠正发音错误，改进第二语言的学习效果。本方法能够使声调识别的结果较传统方法更精确，在科学研究或是日常的应用都能够有很好的效果。The description of the present invention is used to obtain the pitch information contained in the speech, so that the obtained pitch information can meet the requirements of tasks such as speech recognition, speech synthesis, and speech conversion, so that the tasks can be realized more accurately. Tone recognition For some non-native speakers of a second tonal language, such as foreigners learning Chinese, tone recognition will help correct pronunciation errors and improve the learning effect of the second language. The method can make the result of tone recognition more accurate than the traditional method, and can have a good effect in scientific research or daily application.

本发明所述的技术克服了传统的两阶段声调分类方法的不足，采用一种数据驱动的神经网络替换传统的基频提取算法，再将基频提取网络与后端声调识别网络合并形成一个统一的声调识别网络，在训练数据上对整个网络进行参数的联合调优，从而获得更好的声调识别结果。对基频提取网络设计了基于编码器-解码器的提取网络，能够更好地提取基频特征。针对声调解码网络设计了上下文相关的声调预测网络，能够更好地进行声调识别。The technology of the invention overcomes the shortcomings of the traditional two-stage tone classification method, adopts a data-driven neural network to replace the traditional fundamental frequency extraction algorithm, and then combines the fundamental frequency extraction network and the back-end tone recognition network to form a unified On the training data, the parameters of the entire network are jointly tuned to obtain better tone recognition results. An encoder-decoder based extraction network is designed for the fundamental frequency extraction network, which can better extract fundamental frequency features. A context-dependent tone prediction network is designed for the tone decoding network, which can better perform tone recognition.

实施例2Example 2

下面结合具体的实例对实施例1中的方案进行进一步地介绍，详见下文描述：The scheme in embodiment 1 is further introduced below in conjunction with specific examples, and is described in detail below:

步骤1：选取一定数量的普通话语音数据作为声调模型的训练数据(也称为样本)；Step 1: Select a certain amount of Mandarin speech data as the training data (also referred to as samples) of the tone model;

步骤2：构建训练自动语音识别的声学模型，在给定句子内容文本条件下，利用强制对齐的方法获得每个音节的开始和结束时间；Step 2: Build an acoustic model for training automatic speech recognition, and use the forced alignment method to obtain the start and end time of each syllable under the condition of given sentence content and text;

步骤3：构建端到端声调识别模型，确定神经网络的层数、隐含层节点数等所需的各项超参数；Step 3: Build an end-to-end tone recognition model, and determine various hyperparameters required for the number of layers of the neural network and the number of hidden layer nodes;

步骤4：根据端到端声调识别模型选定的输入对数据进行必要的预处理；Step 4: Perform necessary preprocessing on the data according to the selected input of the end-to-end tone recognition model;

步骤5：将选取的训练语音数据送入到构建好的端到端声调识别模型进行训练优化，获取优化的神经网络模型参数，其中训练的速度取决于机器的配置和训练数据的规模；Step 5: Send the selected training speech data into the constructed end-to-end tone recognition model for training and optimization, and obtain the optimized neural network model parameters, wherein the training speed depends on the configuration of the machine and the scale of the training data;

步骤6：不断调节声调分类模型神经网络参数，并不断观察训练模型的结果，选取最优的网络模型参数，保存训练好的声调分类模型神经网络参数。Step 6: Continuously adjust the neural network parameters of the tone classification model, and continuously observe the results of the training model, select the optimal network model parameters, and save the trained tone classification model neural network parameters.

二、进行声调识别：2. Tone recognition:

步骤1：获得测试语音，在给定句子内容文本条件下，利用强制对齐的方法获得每个音节的开始和结束时间；在未给定句子内容文本条件下，使用自动语音识别的方法获得每个音节的开始和结束时间；Step 1: Obtain the test speech. Under the condition of given sentence content text, use the forced alignment method to obtain the start and end time of each syllable; under the condition of no sentence content text, use automatic speech recognition method to obtain each syllable. the start and end times of the syllables;

步骤2：根据端到端声调识别模型选定的输入对数据进行必要的预处理；Step 2: Perform necessary preprocessing on the data according to the selected input of the end-to-end tone recognition model;

步骤3：对选取的训练语音数据送入到训练好参数的端到端声调识别模型中进行识别，最终得到每个测试数据中每个音节的声调类型。Step 3: The selected training speech data is sent to the end-to-end tone recognition model with trained parameters for identification, and finally the tone type of each syllable in each test data is obtained.

实施例3Example 3

下面结合具体的实例、计算公式对实施例1和2中的方案进行进一步地介绍，详见下文描述：The schemes in Embodiments 1 and 2 are further introduced below in conjunction with specific examples and calculation formulas, and are described in detail below:

(1)训练语音识别声学模型(1) Training an acoustic model for speech recognition

在进行音调模型任务之前，需要用语音数据来训练一个基于高斯混合模型-隐马尔可夫模型(GMM-HMM)或深度神经网络-隐马尔可夫模型(DNN-HMM)的语音识别声学模型。在训练数据上，此时给定了每句语音的发音标注，利用语音识别声学模型使用语音识别技术中的强制对齐的方法获得该语音输入对应的每个音节的起始和结束时间。在测试数据上，在未给定语句的标注内容文本时，使用语音识别解码的方法获得该语音输入对应的音节文本以及每个音节的起始与结束时间。由对齐的音素段可以得到音素段的音素边界信息，作为声调识别分类的边界依据。Before performing the pitch model task, speech data is required to train a speech recognition acoustic model based on Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) or Deep Neural Network-Hidden Markov Model (DNN-HMM). On the training data, given the pronunciation label of each speech, the speech recognition acoustic model is used to obtain the start and end time of each syllable corresponding to the speech input using the forced alignment method in speech recognition technology. On the test data, when the annotation content text of the sentence is not given, the method of speech recognition and decoding is used to obtain the syllable text corresponding to the speech input and the start and end time of each syllable. The phoneme boundary information of the phoneme segment can be obtained from the aligned phoneme segments, which is used as the boundary basis for tone recognition and classification.

(2)预处理网络(2) Preprocessing network

输入为语音的原始波形，基频特征提取网络可以采用原始波形采样，此时输入采用1024个语音样本。输入也可以归一化互相关函数计算出来的归一化互相关函数系数。The input is the original waveform of the speech, and the fundamental frequency feature extraction network can use the original waveform sampling. At this time, the input adopts 1024 speech samples. The input can also be normalized cross-correlation function coefficients calculated by the normalized cross-correlation function.

其计算方法为：在第f帧，从语音序列中截取一个窗口序列w_f并进行窗口归一化，再从w_f当中截取一个长度为n的子序列v_f,l，其中l是时滞索引，表示v_f,l在w_f中的偏移量。根据不同的时滞索引l来计算不同的归一化互相关函数系数。The calculation method is: in the fth frame, intercept a window sequence w _f from the speech sequence and normalize the window, and then intercept a subsequence v _f,l of length n from w _f , where l is the time delay. Index, representing the offset of v _f,l in w _f . Different normalized cross-correlation function coefficients are calculated according to different time lag indices l.

归一化互相关函数系数采用以下公式来计算：The normalized cross-correlation function coefficients are calculated using the following formula:

其中，v_f,0为时滞索引l＝0的v_f,l，A为人工经验设定的惩罚因子。Among them, v _f,0 is v _f,l with time lag index l=0, and A is a penalty factor set by human experience.

(3)训练基频特征提取网络(3) Training the fundamental frequency feature extraction network

利用人工标记或传统基音频率提取算法提取基音频率，并利用基频值对应的N个基频状态作为基频特征提取网络的训练标签。给定从第1帧到第F帧基频网络的输入值，基频网络循环神经网络的编码器RNN进行前向计算，计算至F帧时得到的隐含层的输出(h₁,h₂,...,h_F)：The pitch frequency is extracted by artificial labeling or traditional pitch frequency extraction algorithm, and the N fundamental frequency states corresponding to the fundamental frequency value are used as the training labels of the network for fundamental frequency feature extraction. Given the input value of the base frequency network from the 1st frame to the Fth frame, the encoder RNN of the base frequency network recurrent neural network performs forward calculation, and the output of the hidden layer (h ₁ , h ₂ obtained when the calculation reaches the F frame ,...,h _F ):

h_f＝σ(Wx_f+Vh_f-1+b) (2)h _f =σ(Wx _f +Vh _f-1 +b) (2)

其中，σ为Sigmoid函数；W为对x_f的变换矩阵；x_f为循环神经网络当前第f帧的输入；V为对h_f-1的变换矩阵；h_f-1为第f-1帧隐含层的输出；b为偏置向量。Among them, σ is the Sigmoid function; W is the transformation matrix of x _f ; x _f is the input of the current f-th frame of the recurrent neural network; V is the transformation matrix of h f- ₁ ; h _f-1 is the f-1th frame The output of the hidden layer; b is the bias vector.

在解码阶段，基频提取网络中有一个模块称为基频嵌入池。该嵌入池将预测出的基频标签

映射为基频表示向量

基频提取网络根据f+1时刻预测出的基频标签

经过基频嵌入表示池转换为对应的基频嵌入

再与当前时刻循环网络隐含层输出h_f进行拼接，再通过softmax层获得f帧的基频输出标签：In the decoding stage, there is a module in the base frequency extraction network called base frequency embedding pool. The fundamental frequency label that the embedding pool will predict

Mapping to fundamental frequency representation vector

The fundamental frequency label predicted by the fundamental frequency extraction network according to the time f+1

After the base frequency embedding representation pool is converted to the corresponding base frequency embedding

It is then spliced with the output h _f of the hidden layer of the recurrent network at the current moment, and then the base frequency output label of the f frame is obtained through the softmax layer:

其中，Z(·)表示仿射变换。在第F帧，为了计算需要得到F+1时刻的基频嵌入表示，按照公式(3)需要F+1时的基频标签，而这超出了句子基频标签最大范围，针对该问题对F+1帧采用一个EOS(句子结束)的标签，该标签对应最大基频标签数目，取整个基频表示嵌入池中的最后一个。根据上述步骤预测得到第F帧的基频标签之后，根据基频嵌入表示池的查找表找到第F帧的嵌入表示，再将第F帧的预测标签的嵌入表示与F-1时刻的前向网络隐含层输出h_F-1进行拼接，根据公式(3)再送入softmax层得到F-1时刻的基频标签，依次迭代直至回溯至F-2时刻，F-3时刻,…,f时刻，直到1时刻为止。where Z(·) represents an affine transformation. In the F-th frame, in order to calculate the fundamental frequency embedded representation at time F+1, the fundamental frequency label at F+1 is required according to formula (3), which exceeds the maximum range of the fundamental frequency label of the sentence. The +1 frame adopts an EOS (end of sentence) label, which corresponds to the maximum number of base frequency labels, and takes the entire base frequency to represent the last one in the embedding pool. After the base frequency label of the F-th frame is predicted according to the above steps, the embedded representation of the F-th frame is found according to the look-up table of the base-frequency embedded representation pool, and then the embedded representation of the predicted label of the F-th frame is compared with the forward direction at time F-1. The hidden layer of the network outputs h _F-1 for splicing, and then sends it to the softmax layer according to formula (3) to obtain the fundamental frequency label at the time of F-1, and iterates in turn until it goes back to time F-2, time F-3, ..., time f. , until time 1.

在训练上述基频提取网络时，采用教师强制(teacher forcing)方法训练收敛速度和训练效果。教师强制方法是指在训练时使用公式(4)来替换公式(3)利用实际标注的f+1帧的基音频率标签

转换为

以及f帧的隐含状态h_f预测f时刻的基频标签

When training the above fundamental frequency extraction network, a teacher forcing method is used to train the convergence speed and training effect. The teacher-forced method refers to using Equation (4) to replace Equation (3) during training with the pitch frequency labels of the f+1 frames that are actually annotated

convert to

and the hidden state h _f of the f frame predicts the fundamental frequency label at time f

实施过程中采用直接训练和教师强制方法进行随机选择，随机系数根据经验设定，本例中设定为0.5。在训练阶段，在给定整句的波形输入，通过网络输出各个时刻f的基频标签的后验概率：In the implementation process, direct training and teacher-forced methods are used for random selection, and the random coefficient is set based on experience, in this case, it is set to 0.5. In the training phase, given the waveform input of the entire sentence, the posterior probability of the fundamental frequency label at each moment f is output through the network:

根据每一帧的基频标记利用互熵作为目标函数，并利用反向传播算法根据随机梯度下降法优化网络参数。According to the fundamental frequency of each frame, the mutual entropy is used as the objective function, and the network parameters are optimized according to the stochastic gradient descent method using the back-propagation algorithm.

基于训练后的基频提取神经网络提取基频特征时，固定网络参数不变，输入音频波形原始或归一化互相关函数系数来预测每个时间帧的基频标签值，根据标签与基频对应关系得到实际基频预测值。When extracting the fundamental frequency feature based on the trained fundamental frequency extraction neural network, the fixed network parameters remain unchanged, and the original or normalized cross-correlation function coefficients of the audio waveform are input to predict the fundamental frequency label value of each time frame. The corresponding relationship obtains the actual fundamental frequency prediction value.

(4)声调解码识别网络(4) Tone Decoding and Recognition Network

声调解码识别网络如图4表示，该网络是一个声调标签相关的前向解码网络。该网络包括两个部分：一个为音节声调嵌入表示网络，一个为上下文相关的声调解码网络。音节声调嵌入表示网络的特征在于：将基频提取网络输出的基频根据每个音节的边界信息，将每个音节的变长基频序列转变为一个固定维度的向量，具体实施方式为：将基频提取网络的输出按照每个音节的边界信息取出，然后按照前后9帧拼接，送入音节声调嵌入表示网络获得每个音节的固定维度的嵌入表示。The tone decoding and recognition network is shown in Fig. 4, which is a forward decoding network related to tone labels. The network consists of two parts: a representation network for syllable tone embedding and a context-dependent tone decoding network. The feature of the syllable tone embedding representation network is that the fundamental frequency output by the fundamental frequency extraction network is converted into a fixed-dimensional vector of the variable-length fundamental frequency sequence of each syllable according to the boundary information of each syllable. The output of the fundamental frequency extraction network is extracted according to the boundary information of each syllable, and then spliced according to the 9 frames before and after, and sent to the syllable tone embedding representation network to obtain the embedded representation of each syllable with a fixed dimension.

上下文相关的声调解码网络的特征在于利用传统基频提取特征或基频提取神经网络预测出的基频值预测该段语音中的每个音节的声调标签。声调嵌入表示池包括6个声调标签对应的向量表示，其中5个嵌入向量表示对应了5个汉语声调，另外一个嵌入向量表示句首声调。其中，声调嵌入池在反向传播算法中也进行优化训练。The feature of the context-dependent tone decoding network is to use the traditional fundamental frequency extraction feature or the fundamental frequency value predicted by the fundamental frequency extraction neural network to predict the tone label of each syllable in the speech. The tone embedding representation pool includes vector representations corresponding to 6 tone labels, of which 5 embedding vectors represent 5 Chinese tones, and the other embedding vector represents the tones at the beginning of sentences. Among them, the tone embedding pool is also optimized and trained in the back-propagation algorithm.

在预测句首(第1个)音节的声调种类时，将句首声调嵌入向量

与当前音节的嵌入表示

拼接送入声调分类网络进行分类预测得到第1个音节的声调类型

将第1个声调类型对应的声调嵌入向量

再与第2个音节的嵌入

表示相连接送入声调预测网络得到第2个音节的声调标签

依次预测第3个，直至最后一个音节为止。When predicting the tone type of the beginning (1st) syllable, embed the beginning tone into the vector

with the embedded representation of the current syllable

The splicing is sent to the tone classification network for classification and prediction to obtain the tone type of the first syllable

Embed the tone corresponding to the first tone type into the vector

Again with the embedding of the 2nd syllable

Indicates the tone label of the second syllable obtained by connecting the input to the tone prediction network

Predict the 3rd one in sequence until the last syllable.

(5)端到端的声调识别神经网络模型(5) End-to-end tone recognition neural network model

端到端的声调识别神经网络模型是将基频提取网络与声调分类网络连接在一起形成一个总的声调识别网络，网络参数同时进行优化。基频提取网络输出的当前帧的前后9帧的基频标签通过查找N个基频嵌入池转化为9帧的基频嵌入作为声调模型的输入。或者利用加权之后的基频值进行9帧拼接作为后端的方法，通过整体参数调优获得较部分参数调优更好的声调识别结果。The end-to-end tone recognition neural network model connects the fundamental frequency extraction network and the tone classification network to form a total tone recognition network, and the network parameters are optimized at the same time. The fundamental frequency labels of 9 frames before and after the current frame output by the fundamental frequency extraction network are converted into the fundamental frequency embeddings of 9 frames by searching N fundamental frequency embedding pools as the input of the tone model. Or use the weighted fundamental frequency value to perform 9-frame splicing as the back-end method, and obtain better tone recognition results than partial parameter tuning through overall parameter tuning.

综上所述，本发明的优点提高了普通话声调识别的精确度，减少了计算的时间，打破了传统的声调识别分为两个阶段(基频特征提取与声调分类)的框架，构造了一种端到端的声调模型。端到端的模型能够把特征提取阶段和分类阶段作为一个网络整体进行联合优化从而能够得到汉语普通话声调识别的精确度，网络模型的鲁棒性较好，适用于带调语言的声调研究。To sum up, the advantages of the present invention improve the accuracy of Mandarin tone recognition, reduce the calculation time, break the traditional frame of tone recognition divided into two stages (fundamental frequency feature extraction and tone classification), and construct a An end-to-end tone model. The end-to-end model can jointly optimize the feature extraction stage and the classification stage as a whole network to obtain the accuracy of Mandarin Chinese tone recognition. The network model has good robustness and is suitable for tonal research in tonal languages.

本发明实施例对各器件的型号除做特殊说明的以外，其他器件的型号不做限制，只要能完成上述功能的器件均可。In the embodiment of the present invention, the models of each device are not limited unless otherwise specified, as long as the device can perform the above functions.

本领域技术人员可以理解附图只是一个优选实施例的示意图，上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。Those skilled in the art can understand that the accompanying drawing is only a schematic diagram of a preferred embodiment, and the above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages or disadvantages of the embodiments.

以上所述仅为本发明的较佳实施例，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within the range.

Claims

1. An end-to-end tone recognition method based on a neural network, the method comprising:

firstly, training a tone recognition system model:

constructing an end-to-end tone recognition model, and determining parameters required by the number of layers of a neural network, the number of nodes of a hidden layer and the like;

training a voice recognition acoustic model on a training set, and acquiring the starting time and the ending time of each syllable by using forced alignment;

sending the selected training voice data and the tone label of each syllable into an end-to-end tone recognition model for training optimization to obtain optimized neural network model parameters;

continuously adjusting parameters of the neural network model, and selecting optimal parameters of the network model;

secondly, tone recognition:

obtaining a test voice, and obtaining the starting time and the ending time of each syllable by using forced alignment under the condition of a given sentence content text; when not given, obtaining the start and end time of each syllable using automatic speech recognition;

sending the selected test voice data into an end-to-end tone recognition model for recognition, and finally obtaining the tone type of each syllable in each test data;

the method comprises the steps of constructing a trainable deep neural network model, combining a fundamental frequency extraction neural network with a tone decoding neural network to form an end-to-end neural network tone classification model, and training and optimizing network parameters of the two parts at the training stage;

the fundamental frequency extraction neural network is an encoder-decoder based on a cyclic neural network, and the network is divided into a fundamental frequency encoder network and a fundamental frequency decoder network;

the base frequency encoder network encodes the voice by utilizing a recurrent neural network, the base frequency decoder network predicts a base frequency label from the last frame of the voice, converts the predicted base frequency label into a trainable base frequency embedding vector according to the predicted base frequency label, and determines the base frequency label at the current time by taking the base frequency label at the next time and the encoder hidden state at the current time as the common input of a base frequency decoder until the base frequency label of the first frame is predicted;

after the base frequency label of each frame is predicted, the mapping relation between the label and the base frequency value defined in advance is used for converting the base frequency label into the base frequency value sequence of the whole voice.

2. The end-to-end tone recognition method based on the neural network as claimed in claim 1, wherein the tone decoding neural network is divided into two parts: tone representation networks and label-dependent tone classification networks;

the tone representation network maps the predicted fundamental frequency value sequence into a vector with fixed dimensionality according to each syllable;

the tone classification network predicts the tone type of the current syllable based on the tone label predicted for the last syllable and the fixed dimensional vector of the current syllable.

3. The end-to-end tone recognition method based on neural network as claimed in claim 2, wherein the tone classification network predicts the tone type of the current syllable according to the tone label predicted by the previous syllable and the fixed dimension vector of the current syllable, specifically:

firstly, according to the fixed dimension representation of the 1 st syllable, splicing with the fixed dimension corresponding to the beginning of a sentence, and sending the fixed dimension representation and the fixed dimension representation into a tone classification network to predict the tone type of the 1 st syllable;

converting the predicted tone type of the 1 st syllable into a corresponding tone label and then sending the corresponding tone label and the fixed dimension representation of the 2 nd syllable as combined input into a tone classification network to obtain the tone type of the 2 nd syllable;

this is repeated until the tone of the last syllable is predicted.