CN110047468B

CN110047468B - Speech recognition method, apparatus and storage medium

Info

Publication number: CN110047468B
Application number: CN201910418620.XA
Authority: CN
Inventors: 曲贺; 王晓瑞; 李岩
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2019-05-20
Filing date: 2019-05-20
Publication date: 2022-01-25
Anticipated expiration: 2039-05-20
Also published as: CN110047468A

Abstract

The disclosure relates to a voice recognition method, a voice recognition device and a storage medium, and belongs to the technical field of machine learning. The method comprises the following steps: acquiring an audio frame to be identified; respectively extracting the Mel scale filter bank characteristics and the sounding user information vectors of the audio frames; carrying out fusion processing on the Mel scale filter bank characteristics and the sounding user information vector to obtain fusion characteristics; and processing the fusion characteristics based on a target acoustic model to obtain a voice recognition result of the audio frame, wherein the target acoustic model comprises a plurality of void convolution layers. The method can simultaneously extract the Mel scale filter bank characteristics and the sounding user information vectors of the audio frames, then perform characteristic fusion on the two and input the fused characteristics into an acoustic model, and because the fused characteristics can effectively express the characteristics of speakers and channel characteristics, the accuracy rate of voice recognition is improved; in addition, the acoustic model comprises a plurality of cavity convolution layers, so that the calculated amount can be reduced in the same receptive field, and the voice recognition speed is increased.

Description

Speech recognition method, apparatus and storage medium

Technical Field

The present disclosure relates to the field of machine learning technologies, and in particular, to a speech recognition method, apparatus, and storage medium.

Background

Speech Recognition, also known as Automatic Speech Recognition (ASR), is a technique that allows a machine to convert Speech signals into corresponding text or commands through a Recognition and understanding process. Speech recognition technology is widely used in various fields such as industry, home appliances, communication, automotive electronics, medical treatment, home services, consumer electronics, and the like.

Wherein, in the speech recognition process, the accuracy and speed of the speech recognition are crucial. It is well known that the higher the accuracy and speed of speech recognition, the higher the satisfaction of the user. Therefore, how to accurately and rapidly perform speech recognition to improve the speech recognition effect becomes a problem to be solved by those skilled in the art.

Disclosure of Invention

The present disclosure provides a voice recognition method, apparatus, and storage medium, which can effectively improve a voice recognition effect.

According to a first aspect of the embodiments of the present disclosure, there is provided a speech recognition method, including:

acquiring an audio frame to be identified;

respectively extracting the Mel scale filter bank characteristics and the sound production user information vectors of the audio frames;

fusing the Mel scale filter bank characteristics and the sounding user information vector to obtain fused characteristics;

and processing the fusion characteristics based on a target acoustic model to obtain a voice recognition result of the audio frame, wherein the target acoustic model comprises a plurality of void convolution layers.

In a possible implementation manner, the fusing the mel-scale filter bank features and the utterance user information vector includes:

normalizing the Mel scale filter bank characteristics to obtain a first intermediate characteristic;

performing dimension transformation processing on the sound-producing user information vector to obtain a second intermediate feature, wherein the dimension of the second intermediate feature is larger than that of the sound-producing user information vector;

normalizing the second intermediate feature to obtain a third intermediate feature;

and performing fusion processing on the first intermediate feature and the third intermediate feature to obtain the fusion feature.

In a possible implementation manner, the normalizing the mel-scale filter bank feature to obtain a first intermediate feature includes:

normalizing the Mel-scale filter bank features to a mean of 0 and a variance of 1 based on a first BatchNorm (batch normalization) layer, resulting in the first intermediate features;

the normalizing the second intermediate feature to obtain a third intermediate feature includes:

normalizing the second intermediate features to have a mean of 0 and a variance of 1 based on a second BatchNorm layer to obtain the third intermediate features.

In one possible implementation, the target acoustic model includes a hole convolutional neural network and an LSTM (Long Short-Term Memory) network, the hole convolutional neural network including the plurality of hole convolutional layers, the LSTM network including a plurality of LSTM layers;

the processing the fusion feature based on the target acoustic model to obtain the voice recognition result of the audio frame includes:

inputting the fusion characteristics into the cavity convolution neural network, and processing the fusion characteristics through the plurality of cavity convolution layers in sequence, wherein the output of the previous cavity convolution layer is the input of the next cavity convolution layer;

taking a first output result of the last cavity convolution layer as the input of the LSTM network, and sequentially processing the first output result through the plurality of LSTM layers, wherein the output of the last LSTM layer is the input of the next LSTM layer;

determining the speech recognition result based on the second output result of the last LSTM layer.

In a possible implementation manner, the performing the fusion processing on the first intermediate feature and the third intermediate feature to obtain the fused feature includes:

performing column exchange processing on the first intermediate feature and the third intermediate feature to obtain the fusion feature; or the like, or, alternatively,

and performing weighted transformation processing on the first intermediate feature and the third intermediate feature based on a weight matrix to obtain the fusion feature.

According to a second aspect of the embodiments of the present disclosure, there is provided a speech recognition apparatus including:

an acquisition unit configured to acquire an audio frame to be recognized;

an extraction unit configured to extract a Mel-scale filter bank feature and an utterance user information vector of the audio frame, respectively;

the fusion unit is configured to perform fusion processing on the Mel scale filter bank characteristics and the sounding user information vectors to obtain fusion characteristics;

a processing unit configured to process the fusion features based on a target acoustic model, so as to obtain a speech recognition result of the audio frame, where the target acoustic model includes a plurality of void convolution layers.

In one possible implementation manner, the fusion unit includes:

a first processing subunit, configured to perform normalization processing on the mel-scale filter bank features to obtain first intermediate features;

the second processing subunit is configured to perform dimension transformation processing on the sound-producing user information vector to obtain a second intermediate feature, and the dimension of the second intermediate feature is larger than that of the sound-producing user information vector;

a third processing subunit, configured to perform normalization processing on the second intermediate feature to obtain a third intermediate feature;

a fusion subunit configured to perform fusion processing on the first intermediate feature and the third intermediate feature to obtain the fusion feature.

In one possible implementation, the first processing subunit is further configured to normalize the mel-scale filter bank features to a mean of 0 and a variance of 1 based on a first BatchNorm layer, resulting in the first intermediate features;

the third processing subunit is configured to normalize the second intermediate feature to a mean of 0 and a variance of 1 based on a second BatchNorm layer, resulting in the third intermediate feature.

In one possible implementation, the target acoustic model includes a hole convolutional neural network and an LSTM network, the hole convolutional neural network including the plurality of hole convolutional layers, the LSTM network including a plurality of LSTM layers;

the processing unit is further configured to input the fusion feature into the cavity convolutional neural network, and process the fusion feature sequentially through the plurality of cavity convolutional layers, where an output of a previous cavity convolutional layer is an input of a next cavity convolutional layer; taking a first output result of the last cavity convolution layer as the input of the LSTM network, and sequentially processing the first output result through the plurality of LSTM layers, wherein the output of the last LSTM layer is the input of the next LSTM layer; determining the speech recognition result based on the second output result of the last LSTM layer.

In a possible implementation manner, the merging subunit is further configured to perform column exchange processing on the first intermediate feature and the third intermediate feature to obtain the merged feature; or, performing weighted transformation processing on the first intermediate feature and the third intermediate feature based on a weight matrix to obtain the fusion feature.

According to a third aspect of the embodiments of the present disclosure, there is provided a speech recognition apparatus including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: the speech recognition method according to the first aspect is performed.

According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium, wherein instructions, when executed by a processor of a speech recognition apparatus, enable the speech recognition apparatus to perform the speech recognition method of the first aspect.

According to a fifth aspect of embodiments of the present disclosure, there is provided an application program, wherein instructions that, when executed by a processor of a speech recognition apparatus, enable the speech recognition apparatus to perform the speech recognition method of the first aspect.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

in the voice recognition process, the method can simultaneously extract the Mel scale filter bank characteristics and the sound production user information vector of the audio frame, then, the characteristics of the two types are fused, the fused characteristics are used as acoustic characteristics to be input into an acoustic model for voice recognition, and the fused characteristics can effectively express the characteristics of a speaker and the characteristics of a channel, so that the voice recognition mode improves the accuracy of the voice recognition; in addition, the acoustic model comprises a plurality of hole convolution layers, and the calculation amount can be effectively reduced under the same receptive field by utilizing the hole characteristics of the hole convolution layers, so that the speed of voice recognition is increased, namely, the voice recognition method provided by the embodiment of the disclosure can effectively improve the voice recognition effect.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a schematic diagram illustrating a hole convolution according to an exemplary embodiment.

Fig. 2 is a schematic structural diagram illustrating an implementation environment related to a speech recognition method according to an exemplary embodiment.

FIG. 3 is a flow diagram illustrating a method of speech recognition according to an example embodiment.

FIG. 4 is a flow diagram illustrating a method of speech recognition according to an example embodiment.

FIG. 5 is a schematic diagram illustrating a multi-feature fusion process in accordance with an exemplary embodiment.

FIG. 6 is a schematic diagram illustrating the structure of an acoustic model in accordance with an exemplary embodiment.

FIG. 7 is a block diagram illustrating a speech recognition apparatus according to an example embodiment.

FIG. 8 is a block diagram illustrating a speech recognition apparatus according to an example embodiment.

FIG. 9 is a block diagram illustrating a speech recognition apparatus according to an example embodiment.

FIG. 10 is a block diagram illustrating a speech recognition apparatus according to an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

Before explaining the embodiments of the present disclosure in detail, some terms related to the embodiments of the present disclosure are explained.

Mel-scale filter bank characteristics: in the disclosed embodiments, the mel-scale filter bank features refer to FilterBank features, also known as FBank features. The FilterBank algorithm is a front-end processing algorithm, and is used for processing audio in a manner similar to that of human ears, so that the voice recognition performance can be improved.

Acoustic user information vector: in the disclosed embodiments, the voiced user information vector refers to an i-vector feature. Wherein, the i-vector characteristics not only include speaker difference information, but also include channel difference information. In another expression, the i-vector feature can effectively represent the speaker feature and the channel feature, i.e., the i-vector feature is used for characterizing the speaker and the channel.

Hole Convolution (related Convolution): also known as dilation convolution or dilation convolution, can increase the receptive field. In the convolutional neural network, the size of an area of an input layer corresponding to one element in an output result of a certain layer is determined and is referred to as a receptive field. The expression is expressed in mathematical language, that is, the receptive field is a mapping of one element of an output result of a certain layer in the convolutional neural network to an input layer.

Referring to fig. 1, the hole convolution operations when the hole magnification (scaled rate) is equal to 1, 2, and 3 are shown, respectively. Where the left diagram of FIG. 1 corresponds to a 3 × 3 convolution kernel sized 1-scaled hole convolution operation, the convolution operation is the same as a normal convolution operation. The middle diagram of fig. 1 corresponds to the convolution operation of 2-scaled holes with convolution kernel size of 3x3, and the actual convolution kernel size is still 3x3, but the holes are 1, that is, for a feature region with size of 7x7, only features at 9 black square blocks and convolution kernels with size of 3x3 are subjected to convolution operation, and the rest is skipped. It can also be understood that the size of the convolution kernel is 7x7, but the weights at the 9 black squares in the figure are not 0, and the rest are 0. It can be seen from the middle graph that although the size of the convolution kernel is only 3x3, the size of the receptive field of this convolution has increased to 7x 7. The right diagram of FIG. 1 corresponds to a 3-scaled hole convolution operation with a convolution kernel size of 3x 3.

Batch standardization layer: in the disclosed embodiments, the batch normalization layer refers to a BatchNorm layer. The role of the BatchNorm layer is to transform the distribution of input data into a standard normal distribution with a mean value of 0 and a variance of 1 by a certain normalization means.

The following describes an implementation environment related to a speech recognition method provided by an embodiment of the present disclosure.

The voice recognition method provided by the embodiment of the disclosure is applied to voice recognition equipment. Referring to fig. 2, the speech recognition device 201 is a computer device with machine learning capability, for example, the computer device may be a stationary computer device such as a personal computer and a server, and may also be a mobile computer device such as a tablet computer and a smart phone, which is not particularly limited in this disclosure. The speech recognition device 201 includes a feature extraction module that performs front-end processing, and an acoustic model that performs back-end processing.

It is well known that in speech recognition, the accuracy and speed of speech recognition is of paramount importance. In speech recognition, MFCC (Mel Frequency Cepstrum Coefficient) features are usually extracted for front-end processing, and MFCC features cannot effectively characterize speaker differences and channel differences. For the back-end processing, the related art identifies by using CNN (Convolutional Neural Networks) as an acoustic model, and CNN belongs to a feedforward Neural network, and a convolution kernel of such time dimension convolution is calculated by expanding around a time dimension, that is, depending on a history audio frame and a future audio frame; the more left and right audio frames are relied on, the larger the receptive field of the convolutional neural network is, when a plurality of layers of convolutional neural networks are stacked, the more left and right audio frames are relied on, the larger the calculation amount is, and the speech recognition speed is reduced.

Therefore, on one hand, the voice recognition method performs multi-feature fusion, namely, the FBank feature and the i-vector feature are simultaneously extracted during feature extraction, and the FBank feature and the i-vector feature are fused to be used as the input of voice recognition, so that the accuracy of voice recognition is improved.

On the other hand, the voice recognition method takes a new deep neural network as an acoustic model, the acoustic model comprises a hole convolution neural network and an LSTM network, namely the acoustic model comprises a scaled-CNN + LSTM, and compared with the CNN, under the condition of the same receptive field, fewer convolution kernels can be kept, so that the calculated amount is effectively reduced, and the speed of voice recognition is improved.

In addition, under the same calculation amount, the cavity convolution has a larger reception field compared with the CNN, so that more information can be captured.

In addition, the larger the cavity magnification receptive field is, the more information can be captured, so that the accuracy of the voice recognition can be improved by using the cavity convolution.

Fig. 3 is a flow chart illustrating a speech recognition method according to an exemplary embodiment, which is used in the speech recognition apparatus shown in fig. 2, as shown in fig. 3, and includes the following steps.

In step 301, an audio frame to be identified is acquired.

In step 302, mel-scale filter bank features and voiced user information vectors of the audio frame are extracted, respectively.

In step 303, the mel-scale filter bank features and the voice user information vector are fused to obtain fused features.

In step 304, the fusion features are processed based on a target acoustic model, which includes a plurality of hole convolution layers, to obtain a speech recognition result of the audio frame.

According to the method provided by the embodiment of the disclosure, in the voice recognition process, the Mel scale filter bank characteristics and the voice user information vector of the audio frame are extracted simultaneously, then the characteristics of the two types are fused, the fused characteristics are input into an acoustic model as acoustic characteristics for voice recognition, and the fused characteristics can effectively express the characteristics of a speaker and the characteristics of a channel, so that the voice recognition mode improves the accuracy of the voice recognition; in addition, the acoustic model comprises a plurality of hole convolution layers, and the calculation amount can be effectively reduced under the same receptive field by utilizing the hole characteristics of the hole convolution layers, so that the speed of voice recognition is increased, namely, the voice recognition method provided by the embodiment of the disclosure can effectively improve the voice recognition effect.

normalizing the Mel scale filter bank features to have a mean value of 0 and a variance of 1 based on a first batch of normalized BatchNorm layers to obtain the first intermediate features;

All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

It should be noted that descriptions like the first, second, third, fourth, and the like appearing in the following embodiments are only for distinguishing different objects, and do not constitute any other special limitation on the respective objects.

Fig. 4 is a flow chart illustrating a speech recognition method, as shown in fig. 4, for use in the speech recognition device shown in fig. 2, according to an exemplary embodiment, including the following steps.

In step 401, an audio frame to be identified is acquired.

Where an audio frame typically refers to a small piece of audio of fixed length. As an example, in speech recognition, a frame length is usually set to 10 to 30ms (milliseconds), i.e. the playing time of an audio frame is 10 to 30ms, so that there are enough periods within a frame and the variation is not too severe. In the disclosed embodiment, the playing duration of an audio frame is 25ms, i.e. the frame length is 25ms, and the frame shift is 10 ms.

In one possible implementation, the speech recognition device typically preprocesses the speaker's speech prior to feature extraction, where the preprocessing includes, but is not limited to, framing, pre-enhancement, windowing, noise reduction, and the like.

In addition, the voice of the speaker may be the voice collected by the voice collecting device configured in the voice recognition device, or the voice sent to the voice recognition device by other devices, which is not specifically limited in this embodiment of the disclosure.

As an example, the speech recognition may be performed on a frame-by-frame basis, or on a plurality of audio frames, and the embodiment of the present disclosure also does not specifically limit this.

In step 402, FBank features and i-vector features of the audio frame to be identified are extracted, respectively.

Extracting FBank features

For FBank features, FBank feature extraction needs to be performed after preprocessing, and at this time, the voice of the speaker is already framed to obtain individual audio frames, that is, FBank features need to be extracted frame by frame. Since the time domain signal is still obtained after the framing processing, in order to extract the FBank feature, the time domain signal needs to be converted into a frequency domain signal first.

In one possible implementation, the fourier transform may transform the signal from the time domain to the frequency domain. Further, the fourier transform can be further divided into a continuous fourier transform and a discrete fourier transform, and since the audio frame is digital audio rather than analog audio, the embodiments of the present disclosure use the discrete fourier transform to extract FBank features. As one example, FBank feature extraction is typically performed frame-by-frame using FFT (Fast Fourier transform).

In one possible implementation, for example, the dimension of the FBank feature may be 40 dimensions by performing the speech recognition frame by frame, which is not particularly limited by the embodiment of the present disclosure.

Extracting i-vector features

JFA (Joint Factor Analysis) method is to use the subspace of GMM (Gaussian Mixture Model) super vector space to Model speaker difference and channel difference respectively, so as to classify the channel interference. However, in the JFA model, the channel factor also carries part of the speaker information, and part of the speaker information is lost when compensation is performed. Based on the above, a global difference space model is provided, speaker difference and channel difference are modeled as a whole, the method improves JFA requirements on training corpora and JFA problems of high computational complexity, and the performance is equivalent to JFA.

Given a segment of speech of a speaker, the corresponding gaussian mean supervector can be defined as follows:

M＝m+Tw

wherein M is a Gaussian mean value super-vector of the given voice; m is a Gaussian mean value super vector of UBM (Universal Background Model), and the super vector is irrelevant to a specific speaker and a channel; t is a global difference space matrix and is low in rank; w is a global difference space factor, the posterior mean of which is the i-vector feature, which is a priori subject to standard normal distribution.

In the above formula, M and M can be calculated, and the global disparity space matrix T and the global disparity space factor w need to be estimated. Wherein, the global difference space matrix T considers all given speeches to be from different speakers, and even a plurality of speeches of the same speaker are also considered to be from different speakers. The i-vector feature is defined as a maximum posterior point estimate of the global difference space factor w, i.e. the posterior mean of w. In one possible implementation, after the global difference space matrix T is estimated, the estimated value of the i-vector feature can be calculated by extracting the zero-order and first-order Baum-Welch statistics for the given speaker's speech.

In a possible implementation manner, taking frame-by-frame speech recognition as an example, the dimension of the i-vector feature may be 100 dimensions, which is not specifically limited in the embodiment of the present disclosure.

In step 403, the extracted FBank feature and i-vector feature are fused to obtain a fused feature.

In one possible implementation manner, the fusion processing is performed on the extracted FBank feature and the i-vector feature, and the method includes the following steps:

4031. and carrying out normalization processing on the FBank characteristics to obtain first intermediate characteristics.

Referring to fig. 5, a multi-feature fusion process is shown. For the FBank feature, it is first normalized by a BatchNorm layer. It should be noted that, for the sake of convenience, this BatchNorm layer is referred to as a first BatchNorm layer in the embodiments of the present disclosure.

In one possible implementation, the FBank feature is normalized to obtain a first intermediate feature, which includes but is not limited to: based on the first BatchNorm layer, the FBank features are normalized to a mean of 0 and a variance of 1, resulting in a first intermediate feature.

As shown in fig. 5, the dimension of the FBank feature after passing through the BatchNorm layer is not changed, and is still 40 dimensions. In addition, for ease of reference, the FBank feature passing through the BatchNorm layer is also referred to herein as the first intermediate feature.

4032. And carrying out dimension transformation processing on the i-vector characteristic to obtain a second intermediate characteristic.

As shown in fig. 5, before normalizing the i-vector feature carrying the speaker feature and the channel feature, the i-vector feature is subjected to a dimension transformation process through a linear mapping (linear) layer. In one possible implementation, the dimension transformation process is an upscaling process, i.e., the dimension of the second intermediate feature is larger than the dimension of the original i-vector feature. For example, the i-vector feature is 100 dimensions before passing through the linear layer, and the mapping of the 100 dimensional i-vector feature by the linear layer is referred to as 200 dimensions.

Wherein the i-vector feature passing through the linear layer is also referred to herein as a second intermediate feature.

4033. And carrying out normalization processing on the second intermediate features to obtain third intermediate features.

Referring to fig. 5, the i-vector feature passes through the line layer and then passes through a BatchNorm layer, which is referred to as a second BatchNorm layer for the sake of distinction in the embodiments of the present disclosure.

Similarly, the second intermediate features are normalized to obtain third intermediate features, including but not limited to: based on the second BatchNorm layer, the second intermediate features are normalized to have a mean of 0 and a variance of 1, resulting in third intermediate features.

As shown in FIG. 5, the dimension of the i-vector feature passing through the line layer and passing through the BatchNorm layer is not changed and is still 200 dimensions. In addition, for ease of reference, the second intermediate feature that passes through the BatchNorm layer is also referred to herein as the third intermediate feature.

4034. And performing fusion processing on the first intermediate feature and the third intermediate feature to obtain a fusion feature.

Next, referring to fig. 5, the FBank feature and the i-vector feature after passing through the two BatchNorm layers are used as input together, and input to a fusion (combination) layer for fusion processing.

The combination layer may be a linear mapping layer or a column exchange layer, which is not specifically limited in this disclosure. In a possible implementation manner, the parameters of the combine layer may be initialized randomly and optimized based on a back propagation algorithm, for example, the optimization is performed by using a stochastic gradient descent algorithm, which is not specifically limited in the embodiment of the present disclosure.

As an example, the first intermediate feature and the third intermediate feature are subjected to a fusion process to obtain a fused feature, which includes but is not limited to the following two ways:

firstly, when the combination layer is a column exchange layer, column exchange processing is carried out on the first intermediate feature and the third intermediate feature to obtain a fusion feature. Wherein the dimensions of the fused feature are consistent with the dimensions of the first and third intermediate features. As shown in fig. 5, when the first intermediate feature has a dimension of 40 and the third intermediate feature has a dimension of 200, the dimension of the fused feature is 40 × 6. As an example, the column exchange process is used to exchange columns of features, such as exchanging a first column of features with a second column of features, or exchanging a first column of features with a last column of features, and the like, which is not specifically limited in the embodiments of the present disclosure.

Secondly, when the combination layer is a linear layer, the first intermediate feature and the third intermediate feature are subjected to weighted transformation processing based on the weight matrix to obtain a fusion feature. I.e. fusion via the linear layer, which is equivalent to multiplying the first intermediate feature and the third intermediate feature by a weight. The weight matrix may be randomly initialized and obtained by joint training with the acoustic model, which is not specifically limited in the embodiments of the present disclosure.

After the FBank characteristic and the i-vector characteristic which are subjected to the two BatchNorm layers are subjected to the fusion processing, the robustness of the characteristic can be enhanced.

In step 404, the fusion features are processed based on a target acoustic model to obtain a speech recognition result of the audio frame to be recognized, wherein the target acoustic model includes a hole convolutional neural network and an LSTM network, the hole convolutional neural network includes a plurality of hole convolutional layers, and the LSTM network includes a plurality of LSTM layers.

In the embodiment of the present disclosure, the output of the combine layer in step 403 is the input of the target acoustic model.

In one possible implementation, referring to fig. 6, the hole convolutional neural network comprises a total of 6 hole convolutional layers, which are respectively referred to as hole convolutional layers 0 through 5. The cavity convolution is 2-dimensional convolution, the convolution kernel size is M x N, wherein M represents the size of time domain convolution, and N represents the size of frequency domain convolution. As an example, the convolution kernel for each of 6 hole convolution layers, hole convolution layer 0 through hole convolution layer 5, is as follows:

for the 0 th layer of the cavity convolution, the size of convolution kernels is 7x 3, and the number of the convolution kernels is 64; for the 1 st layer of the cavity convolution, the size of convolution kernels is 5 x3, and the number of the convolution kernels is 64; for the cavity convolution layer 2, the size of convolution kernels is 3x3, and the number of convolution kernels is 128; for the 3 rd layer of cavity convolution, the size of convolution kernels is 3x3, and the number of convolution kernels is 128; for the 4 th layer of cavity convolution, the size of convolution kernels is 3x3, and the number of convolution kernels is 256; for the 5 th layer of the cavity convolution, the convolution kernel size is 3x3, and the number of convolution kernels is 256.

The LSTM network is an RNN (Recurrent Neural Networks) structure that is widely used in acoustic models at present. Compared with the common RNN, the LSTM controls the storage, input and output of information through a well-designed gate structure, and meanwhile, the problem of gradient disappearance of the common RNN can be avoided to a certain extent, so that the LSTM network can effectively model the long-term correlation of the time sequence signal.

In one possible implementation, the LSTM network in the acoustic model typically contains 3-5 LSTM layers, because stacking more LSTM layers directly to build a deeper network would not bring performance improvement, but would make the model performance worse. As an example, referring to fig. 6, an LSTM network includes 3 LSTM layers.

Among them, LSTM differs from RNN mainly in that: the LSTM incorporates a "processor" in the algorithm that determines whether information is useful, and the structure of this "processor" role is called a cell, i.e., the LSTM cell. Three gates, namely an input gate, a forgetting gate and an output gate, are arranged in one LSTM cell.

In a possible implementation manner, when training a target acoustic model, a dictionary may be used as a training corpus, an acoustic model with an architecture shown in fig. 6 is used as an initial model, a monophonic element, a polyphonic element, a letter, a word, a chinese character, or the like is used as a training target, and a stochastic gradient descent algorithm is used to optimize the model, which is not specifically limited in the embodiment of the present disclosure.

In the embodiment of the present disclosure, processing the fusion feature based on the target acoustic model to obtain the speech recognition result of the audio frame includes the following steps:

4041. and inputting the fusion characteristics into a cavity convolution neural network, and processing the fusion characteristics through a plurality of cavity convolution layers in sequence, wherein the output of the previous cavity convolution layer is the input of the next cavity convolution layer. As shown in fig. 6, the hole convolution operation is performed on the fusion feature sequentially through the hole convolution layer 0 to the hole convolution layer 5.

4042. And taking the first output result of the last cavity convolution layer as the input of the LSTM network, and sequentially processing the first output result through a plurality of LSTM layers, wherein the output of the last LSTM layer is the input of the next LSTM layer.

In this document, in order to distinguish the output results of the hole convolutional neural network and the LSTM network, the output result of the hole convolutional neural network is referred to as a first output result, and the output result of the LSTM network is referred to as a second output result. As shown in fig. 6, the output result of the empty convolutional neural network is processed through the LSTM 0 th layer, the LSTM 1 st layer and the LSTM 2 nd layer in one pass.

4043. And determining a voice recognition result of the audio to be recognized based on the second output result of the last LSTM layer.

The method comprises the steps of inputting the fusion features into a hole convolution neural network stacked in multiple layers to effectively learn more abstract features, then outputting a first output result into an LSTM network stacked in multiple layers, and finally outputting a phonetic (acoustic) category corresponding to a current audio frame to be identified through an output layer.

The phonetic category may be a multi-element phone (senone), or may be a phone (phone), or may also be a letter, a chinese character, or a word.

The above implementation of determining the phonetic category of the acoustic features extracted from the audio frames based on the acoustic model is also possible, in one possible implementation, to further convert the phonetic category into a text readable and understandable by the user using a language model and a decoding technology, and this is not particularly limited by the embodiment of the present disclosure.

According to the method provided by the embodiment of the disclosure, in the voice recognition process, the FBank feature and the i-vector feature of the audio frame are extracted at the same time, then the FBank feature and the i-vector feature are subjected to feature fusion, and the fused features are input into an acoustic model as acoustic features for voice recognition. In addition, the acoustic model adopts the cavity convolution neural network, the calculated amount can be effectively reduced under the condition of the same receptive field by utilizing the cavity characteristics of the cavity convolution layer, and compared with CNN, the speed of voice recognition is accelerated.

In summary, the speech recognition method provided by the embodiment of the present disclosure has a better speech recognition effect.

FIG. 7 is a block diagram illustrating a speech recognition apparatus according to an example embodiment. Referring to fig. 7, the apparatus includes an acquisition unit 701, an extraction unit 702, a fusion unit 703, and a processing unit 704.

An acquisition unit 701 configured to acquire an audio frame to be recognized;

an extracting unit 702 configured to extract mel-scale filter bank features and a sounding user information vector of the audio frame, respectively;

a fusion unit 703 configured to perform fusion processing on the mel scale filter bank feature and the sounding user information vector to obtain a fusion feature;

a processing unit 704 configured to process the fusion features based on a target acoustic model, which includes a plurality of void convolution layers, to obtain a speech recognition result of the audio frame.

According to the device provided by the embodiment of the disclosure, in the voice recognition process, the Mel scale filter bank characteristics and the voice user information vector of the audio frame can be simultaneously extracted, then the characteristics of the two types are fused, the fused characteristics are input into an acoustic model as acoustic characteristics for voice recognition, and the fused characteristics can effectively express the characteristics of a speaker and the characteristics of a channel, so that the voice recognition mode improves the accuracy of the voice recognition; in addition, the acoustic model comprises a plurality of hole convolution layers, and the calculation amount can be effectively reduced under the same receptive field by utilizing the hole characteristics of the hole convolution layers, so that the speed of voice recognition is increased, namely, the voice recognition method provided by the embodiment of the disclosure can effectively improve the voice recognition effect.

In one possible implementation, referring to fig. 8, the fusion unit 703 includes:

a first processing subunit 7031, configured to perform normalization processing on the mel-scale filter bank features to obtain first intermediate features;

a second processing subunit 7032, configured to perform dimension transformation processing on the sound-generating user information vector to obtain a second intermediate feature, where a dimension of the second intermediate feature is greater than a dimension of the sound-generating user information vector;

a third processing subunit 7033, configured to perform normalization processing on the second intermediate feature to obtain a third intermediate feature;

a fusion subunit 7034, configured to perform a fusion process on the first intermediate feature and the third intermediate feature, so as to obtain the fused feature.

In one possible implementation, first processing subunit 7031 is further configured to normalize the mel-scale filter bank features to a mean of 0 and a variance of 1 based on a first BatchNorm layer, resulting in the first intermediate features;

a third processing subunit 7033, configured to normalize the second intermediate features to a mean of 0 and a variance of 1, based on a second BatchNorm layer, resulting in the third intermediate features.

a processing unit 704, further configured to input the fusion feature into the hole convolutional neural network, and process the fusion feature sequentially through the plurality of hole convolutional layers, where an output of a previous hole convolutional layer is an input of a next hole convolutional layer; taking a first output result of the last cavity convolution layer as the input of the LSTM network, and sequentially processing the first output result through the plurality of LSTM layers, wherein the output of the last LSTM layer is the input of the next LSTM layer; determining the speech recognition result based on the second output result of the last LSTM layer.

In a possible implementation manner, the fusing subunit 7034 is further configured to perform column exchange processing on the first intermediate feature and the third intermediate feature to obtain the fused feature; or, performing weighted transformation processing on the first intermediate feature and the third intermediate feature based on a weight matrix to obtain the fusion feature.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 9 is a block diagram of a speech recognition apparatus provided in an embodiment of the present disclosure, where the apparatus 900 may be a server. The apparatus 900 may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 901 and one or more memories 902, where the memory 902 stores at least one instruction, and the at least one instruction is loaded and executed by the processors 901 to implement the voice recognition method provided by the above-mentioned method embodiments. Of course, the apparatus may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the apparatus may also include other components for implementing the functions of the device, which are not described herein again.

In an exemplary embodiment, a computer-readable storage medium, such as a memory, is also provided that includes instructions executable by a processor in a terminal to perform the speech recognition method in the above-described embodiments. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Fig. 10 shows a block diagram of an apparatus 1000 according to an exemplary embodiment of the disclosure. The apparatus 1000 may be a mobile terminal.

In general, the apparatus 1000 includes: a processor 1001 and a memory 1002.

Processor 1001 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 1001 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1001 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also referred to as a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1001 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 1001 may further include an AI (Artificial Intelligence) processor for processing a computing operation related to machine learning.

Memory 1002 may include one or more computer-readable storage media, which may be non-transitory. The memory 1002 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1002 is used to store at least one instruction for execution by processor 1001 to implement the speech recognition methods provided by method embodiments in the present disclosure.

In some embodiments, the apparatus 1000 may further include: a peripheral interface 1003 and at least one peripheral. The processor 1001, memory 1002 and peripheral interface 1003 may be connected by a bus or signal line. Various peripheral devices may be connected to peripheral interface 1003 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1004, touch screen display 1005, camera 1006, audio circuitry 1007, positioning components 1008, and power supply 1009.

The peripheral interface 1003 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 1001 and the memory 1002. In some embodiments, processor 1001, memory 1002, and peripheral interface 1003 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1001, the memory 1002, and the peripheral interface 1003 may be implemented on separate chips or circuit boards, which are not limited by this embodiment.

The Radio Frequency circuit 1004 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 1004 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1004 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1004 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 1004 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 1004 may also include NFC (Near Field Communication) related circuits, which are not limited by this disclosure.

The display screen 1005 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1005 is a touch display screen, the display screen 1005 also has the ability to capture touch signals on or over the surface of the display screen 1005. The touch signal may be input to the processor 1001 as a control signal for processing. At this point, the display screen 1005 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 1005 may be one, providing the front panel of the device 1000; in other embodiments, the display screen 1005 may be at least two, respectively disposed on different surfaces of the device 1000 or in a folded design; in still other embodiments, the display 1005 may be a flexible display, disposed on a curved surface or on a folded surface of the device 1000. Even more, the display screen 1005 may be arranged in a non-rectangular irregular figure, i.e., a shaped screen. The Display screen 1005 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and the like.

The camera assembly 1006 is used to capture images or video. Optionally, the camera assembly 1006 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1006 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuit 1007 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1001 for processing or inputting the electric signals to the radio frequency circuit 1004 for realizing voice communication. For stereo sound acquisition or noise reduction purposes, the microphones may be multiple and disposed at different locations of the device 1000. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1001 or the radio frequency circuit 1004 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuit 1007 may also include a headphone jack.

The positioning component 1008 is used to locate the current geographic position of the device 1000 for navigation or LBS (Location Based Service). The Positioning component 1008 can be a Positioning component based on the Global Positioning System (GPS) in the united states, the beidou System in china, or the galileo System in russia.

A power supply 1009 is used to power the various components in the apparatus 1000. The power source 1009 may be alternating current, direct current, disposable batteries, or rechargeable batteries. When the power source 1009 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the device 1000 further comprises one or more sensors 1010. The one or more sensors 1010 include, but are not limited to: acceleration sensor 1011, gyro sensor 1012, pressure sensor 1013, fingerprint sensor 1014, optical sensor 1015, and proximity sensor 1016.

The acceleration sensor 1011 can detect the magnitude of acceleration on three coordinate axes of a coordinate system established with the apparatus 1000. For example, the acceleration sensor 1011 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 1001 may control the touch display screen 1005 to display a user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1011. The acceleration sensor 1011 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 1012 may detect a body direction and a rotation angle of the apparatus 1000, and the gyro sensor 1012 and the acceleration sensor 1011 may cooperate to acquire a 3D motion of the apparatus 1000 by the user. From the data collected by the gyro sensor 1012, the processor 1001 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensors 1013 may be disposed on a side bezel of device 1000 and/or on a lower layer of touch display screen 1005. When the pressure sensor 1013 is disposed on a side frame of the device 1000, a user's holding signal of the device 1000 can be detected, and the processor 1001 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 1013. When the pressure sensor 1013 is disposed at a lower layer of the touch display screen 1005, the processor 1001 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 1005. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 1014 is used to collect a fingerprint of the user, and the processor 1001 identifies the user according to the fingerprint collected by the fingerprint sensor 1014, or the fingerprint sensor 1014 identifies the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, the processor 1001 authorizes the user to perform relevant sensitive operations including unlocking a screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 1014 may be disposed on the front, back, or side of the device 1000. When a physical key or vendor Logo is provided on the device 1000, the fingerprint sensor 1014 may be integrated with the physical key or vendor Logo.

The optical sensor 1015 is used to collect the ambient light intensity. In one embodiment, the processor 1001 may control the display brightness of the touch display screen 1005 according to the intensity of the ambient light collected by the optical sensor 1015. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 1005 is increased; when the ambient light intensity is low, the display brightness of the touch display screen 1005 is turned down. In another embodiment, the processor 1001 may also dynamically adjust the shooting parameters of the camera assembly 1006 according to the intensity of the ambient light collected by the optical sensor 1015.

A proximity sensor 1016, also known as a distance sensor, is typically provided on the front panel of the device 1000. The proximity sensor 1016 is used to capture the distance between the user and the front of the device 1000. In one embodiment, the processor 1001 controls the touch display screen 1005 to switch from the bright screen state to the dark screen state when the proximity sensor 1016 detects that the distance between the user and the front surface of the device 1000 gradually decreases; when the proximity sensor 1016 detects that the distance between the user and the front of the device 1000 is gradually increased, the touch display screen 1005 is controlled by the processor 1001 to switch from a breath-screen state to a bright-screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 10 is not intended to be limiting of the apparatus 1000 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A speech recognition method, comprising:

acquiring an audio frame to be identified; respectively extracting Mel scale filter bank characteristics and sound production user information vectors of the audio frames, wherein the sound production user information vectors are obtained through a global difference space model, and the global difference space model is obtained through modeling by taking sound production user differences and channel differences as a whole;

normalizing the Mel scale filter bank characteristics to obtain a first intermediate characteristic; performing dimension transformation processing on the sound-producing user information vector to obtain a second intermediate feature, wherein the dimension of the second intermediate feature is larger than that of the sound-producing user information vector; normalizing the second intermediate feature to obtain a third intermediate feature; performing fusion processing on the first intermediate feature and the third intermediate feature to obtain a fusion feature;

inputting the fusion characteristics into a cavity convolution neural network of a target acoustic model, and processing the fusion characteristics through a plurality of cavity convolution layers included in the cavity convolution neural network in sequence, wherein the output of the previous cavity convolution layer is the input of the next cavity convolution layer;

taking the first output result of the last cavity convolution layer as the input of the long-short term memory LSTM network of the target acoustic model, and processing the first output result through a plurality of LSTM layers included in the LSTM network in sequence, wherein the output of the last LSTM layer is the input of the next LSTM layer; a speech recognition result is determined based on the second output result of the last LSTM layer.

2. The speech recognition method of claim 1, wherein the normalizing the mel-scale filter bank features to obtain a first intermediate feature comprises:

3. The speech recognition method of claim 1, wherein the fusing the first intermediate feature and the third intermediate feature to obtain a fused feature comprises:

4. A speech recognition apparatus, comprising:

an acquisition unit configured to acquire an audio frame to be recognized;

an extraction unit configured to extract a mel-scale filter bank feature and an utterance user information vector of the audio frame, respectively, wherein the utterance user information vector is obtained by a global difference space model obtained by modeling an utterance user difference and a channel difference as a whole;

the fusion unit is configured to carry out normalization processing on the Mel scale filter bank characteristics to obtain first intermediate characteristics; performing dimension transformation processing on the sound-producing user information vector to obtain a second intermediate feature, wherein the dimension of the second intermediate feature is larger than that of the sound-producing user information vector; normalizing the second intermediate feature to obtain a third intermediate feature; performing fusion processing on the first intermediate feature and the third intermediate feature to obtain a fusion feature;

the processing unit is configured to input the fusion characteristics into a cavity convolutional neural network of a target acoustic model, and sequentially process the fusion characteristics through a plurality of cavity convolutional layers included in the cavity convolutional neural network, wherein the output of a previous cavity convolutional layer is the input of a next cavity convolutional layer; taking the first output result of the last cavity convolution layer as the input of the long-short term memory LSTM network of the target acoustic model, and processing the first output result through a plurality of LSTM layers included in the LSTM network in sequence, wherein the output of the last LSTM layer is the input of the next LSTM layer; a speech recognition result is determined based on the second output result of the last LSTM layer.

5. The speech recognition apparatus according to claim 4, wherein the fusion unit includes:

a first processing subunit configured to normalize the mel-scale filter bank features to a mean of 0 and a variance of 1, based on a first batch of normalized BatchNorm layers, resulting in the first intermediate features;

a third processing subunit configured to normalize the second intermediate feature to a mean of 0 and a variance of 1 based on a second BatchNorm layer, resulting in the third intermediate feature.

6. The speech recognition apparatus according to claim 4, wherein the fusion unit includes:

a fusion subunit configured to perform column exchange processing on the first intermediate feature and the third intermediate feature to obtain the fusion feature; or, performing weighted transformation processing on the first intermediate feature and the third intermediate feature based on a weight matrix to obtain the fusion feature.

7. A speech recognition apparatus, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: -performing a speech recognition method as claimed in any of the preceding claims 1 to 3.

8. A non-transitory computer-readable storage medium, wherein instructions in the storage medium, when executed by a processor of a speech recognition apparatus, enable the speech recognition apparatus to perform the speech recognition method of any of claims 1 to 3.