Detailed Description
It should be noted that the principles of the present application are illustrated as implemented in a suitable computing environment. The following description is based on illustrative embodiments of the application and should not be taken as limiting other embodiments of the application not described in detail herein. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.
Relational terms such as first and second, and the like may be used solely to distinguish one object or operation from another object or operation without necessarily limiting the actual sequential relationship between the objects or operations. In the description of the embodiments of the present application, the meaning of "plurality" is two or more, unless explicitly defined otherwise.
Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is a theory, method, technique, and application system that utilizes a digital computer or a machine controlled by a digital computer to simulate, extend and extend the intelligence, sense the environment, acquire knowledge, and use the knowledge to obtain optimal results of a person. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. Artificial intelligence software techniques mainly include machine learning (MACHINE LEARNING, ML) techniques, where deep learning (DEEP LEARNING, DL) is a new direction of research in machine learning that is introduced into machine learning to bring it closer to the original goal, i.e., artificial intelligence. At present, deep learning is mainly applied to the fields of computer vision, natural language processing and the like.
Deep learning is the inherent regularity and presentation hierarchy of learning sample data, and information obtained during such learning processes greatly aids in interpretation of data such as text, image and sound. The deep learning technology and the corresponding training data set are utilized to train and obtain network models realizing different functions, for example, a deep learning network for gender classification can be trained based on one training data set, a deep learning network for image optimization can be trained based on another training data set, and the like.
In order to improve the efficiency of pronunciation skill detection, the application introduces deep learning into pronunciation skill detection, and correspondingly provides a pronunciation skill detection method, a pronunciation skill detection device, a storage medium and an electronic device. Wherein the pronunciation skill detection method may be performed by an electronic device.
Referring to fig. 1, the present application further provides a sound skill detecting system, as shown in fig. 1, where the sound skill detecting system includes an electronic device 100, for example, the electronic device may obtain a text to be detected for detecting sound skill, and convert the text to be detected into a corresponding phoneme sequence, when the electronic device is further configured with a microphone, audio collection may be performed during a speaker speaks the text to be detected, so as to obtain audio to be detected obtained by the speaker speaking the text to be detected, and extract an acoustic feature of the audio to be detected, and then the obtained phoneme sequence and acoustic feature are further input into a trained sound skill detecting model to perform sound skill detecting processing, so as to obtain a first detecting result and a second detecting result, where the first detecting result is used for indicating whether the speaker needs to speak the text to be detected using sound skill, and the second detecting result is used for indicating whether the speaker speaks the text to be detected using sound skill.
The electronic device 100 may be any device equipped with a processor and having processing capabilities, such as a mobile electronic device with a processor, e.g., a smart phone, a tablet computer, a palmtop computer, a notebook computer, or a stationary electronic device with a processor, e.g., a desktop computer, a television, a server, etc.
In addition, as shown in fig. 1, the pronunciation skill detection system may further include a storage device 200 for storing data including, but not limited to, raw data, intermediate data, result data, etc. obtained during the pronunciation skill detection process, for example, the electronic device 100 may store the acquired text to be detected, audio to be detected, a phoneme sequence converted from the text to be detected, acoustic features extracted from the audio to be detected, and the first detection result and the second detection result output by the pronunciation skill detection model in the storage device 200.
It should be noted that, the schematic view of the situation of the pronunciation skill detection system shown in fig. 1 is only an example, and the pronunciation skill detection system and the situation described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and those skilled in the art can know that, with the evolution of the pronunciation skill detection system and the appearance of a new service situation, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.
Referring to fig. 2, fig. 2 is a flowchart of a pronunciation skill detection method according to an embodiment of the present application. As shown in fig. 2, the flow of the pronunciation skill detection method provided by the embodiment of the present application may be as follows:
in S310, a text to be detected is acquired, and the text to be detected is converted into a corresponding phoneme sequence.
Wherein, the text to be detected refers to the text for detecting the pronunciation skill, and the pronunciation skill detection comprises detecting whether the text to be detected needs to be spoken by using the pronunciation skill or not, and detecting whether the speaker speaks the text to be detected by using the pronunciation skill or not.
It should be noted that, different languages such as chinese and english have respective pronunciation skills, and for example, english has pronunciation skills such as continuous reading, burst loss, and turbidity loss.
The continuous reading means that the head and tail phonemes of two adjacent words are naturally spelled together, and the middle is not stopped.
The loss of explosion means that when two explosion sounds (such as p, d, t, k, g) are adjacent, the former explosion sound only makes a sound mouth shape according to the sound making part to form an obstruction, but does not explode out, and after a small pause, the latter consonant is emitted. The previous plosive is then referred to as a lost plosive, such as goo (d) bye.
The turbidity is a value of s/before a clear consonant, the clear consonant has a corresponding turbidity consonant, and the clear consonant has a vowel after the clear consonant, and at the moment, the clear consonant is read as the corresponding turbidity consonant. Taking spaak as an example, clear consonant/p/preceding with/s/this sound,/p/corresponding voiced consonant is/b/,/p/following with vowel/i:/, where the original/spi: k/reading/sbi: k/.
As described above, the pronunciation skill detection method provided by the application can be used for detecting the pronunciation skill of any language, and correspondingly, the text to be detected can be the text of any language according to the requirement of detecting the actual pronunciation skill. After the obtained text to be detected, the electronic equipment further converts the obtained text to be detected into a corresponding phoneme sequence. For example, the electronic device may convert the acquired text to be detected into a corresponding phoneme sequence according to a pronunciation dictionary.
In an alternative embodiment, converting the text to be detected into a corresponding phoneme sequence includes:
Removing text units which are not pronounciated in the text to be detected to obtain a new text to be detected;
and converting each text unit in the new text to be detected into a corresponding phoneme unit to obtain a phoneme sequence.
It will be appreciated that not all text units in any text need be pronounced when the text is spoken, e.g., punctuation in the text does not.
Therefore, in order to eliminate the interference of the non-sounding text units and improve the accuracy of sounding skill detection, when the electronic equipment converts the text to be detected into a corresponding phoneme sequence, firstly removing the non-sounding text units (such as punctuation marks, emoticons and the like) in the text to be detected to obtain a new text to be detected, and then converting each text unit in the new text to be detected into the corresponding phoneme sequence according to a sounding dictionary.
For example, when the pronunciation skill of english is required to be detected, the electronic device obtains that a text to be detected of english is "Please turn on the light", "a text unit in the text to be detected is a punctuation mark", "is a text unit without pronunciation, after removing the text unit without pronunciation", "a new text to be detected is obtained as" Please turn on the light ", and further, each text unit in the new text to be detected is converted into a corresponding phoneme unit according to a pronunciation dictionary, so as to obtain a phoneme sequence
In addition, to more clearly characterize the phoneme sequence, the electronic device may also add a start flag and an end flag before and after the phoneme sequence, respectively, the start of the phoneme sequence being characterized by the start flag and the end of the phoneme sequence being characterized by the end flag. The specific configuration of the start flag and the end flag is not particularly limited herein, and may be configured by those skilled in the art according to actual needs.
For example, a start flag may be configured as "< bos >" and an end flag may be configured as "< eos >", with the electronic device having the above phoneme sequenceAfter adding the start flag and the end flag, the method changes to
In S320, the audio to be detected obtained by the speaker speaking the text to be detected is obtained, and the acoustic features of the audio to be detected are extracted.
In this embodiment, the electronic device obtains the audio to be detected obtained by the speaker speaking the text to be detected, in addition to converting the text to be detected into the corresponding phoneme sequence. The data format of the audio to be detected is not particularly limited here, and may be configured by those skilled in the art according to the actual detection needs.
The speaker may be a real person or a virtual person.
For example, when the speaker is a real person, the electronic device may perform audio collection on the voice of the text to be detected, which is spoken by the real person, through the configured audio collection device (which may be an internal audio collection device or an external audio collection device), and take the collected audio as the audio to be detected, and in addition, the electronic device may also obtain the audio to be detected, which is collected by other electronic devices, from other electronic devices, where the real person speaks the audio to be detected of the text to be detected. Correspondingly, by utilizing the audio to be detected obtained at the moment, the electronic equipment can apply the pronunciation skill detection method provided by the application to detect the pronunciation ability of the real person.
For another example, when the speaker is a virtual person, for example, based on artificial intelligence, the electronic device may directly input the text to be detected into the speech synthesis software, perform speech synthesis by the speech synthesis software, and output the synthesized audio, and use the audio as the audio to be detected. Correspondingly, the electronic equipment can be used for detecting the voice synthesis capability of the voice synthesis software by using the audio to be detected obtained at the moment.
As described above, the electronic device further extracts the acoustic features of the audio to be detected after obtaining the audio to be detected obtained by the speaker speaking the text to be detected. The acoustic characteristics refer to physical quantities representing acoustic characteristics of speech, and are also collectively called acoustic performance of various elements of sound, such as energy concentration areas representing timbre, formant frequencies, formant intensities, bandwidths and the like, and duration, fundamental frequency, average speech power and the like representing prosody characteristics of speech.
In an alternative embodiment, to further improve accuracy of pronunciation skill detection, extracting acoustic features of the audio to be detected includes:
Extracting Filterbank features, fundamental frequency features and energy features of the audio to be detected;
and fusing the Filterbank characteristic, the fundamental frequency characteristic and the energy characteristic to obtain the acoustic characteristic.
In this embodiment, the Filterbank feature, the fundamental frequency feature and the energy feature are used as acoustic features related to the pronunciation skill, and accordingly, when the acoustic features for detecting the pronunciation skill are extracted, the electronic device extracts the Filterbank feature, the fundamental frequency feature and the energy feature of the audio to be detected. The dimensions of the extracted Filterbank features are not particularly limited herein, and may be configured by those skilled in the art according to actual needs, for example, in this embodiment, the electronic device may extract Filterbank features of 40 dimensions of the audio to be detected.
As above, after extracting the Filterbank feature, the fundamental frequency feature and the energy feature of the audio to be detected, the electronic device further fuses the Filterbank feature, the fundamental frequency feature and the energy feature according to the configured fusion strategy to obtain a fusion feature, and the fusion feature is used as an acoustic feature for detecting pronunciation skills. The configuration of the fusion policy is not particularly limited herein, and may be configured by those skilled in the art according to actual needs.
For example, referring to fig. 3, the fusion strategy configured in this embodiment is to splice the Filterbank feature, the fundamental frequency feature and the energy feature according to the time dimension, so as to obtain the acoustic feature for detecting the pronunciation skill.
In S330, the phoneme sequence and the acoustic feature are input into a trained pronunciation skill detection model to perform a pronunciation skill detection process, so as to obtain a first detection result and a second detection result.
It should be noted that, according to the present application, the corresponding pronunciation skill detection model is pre-trained for different languages, for example, the pronunciation skill detection model for detecting pronunciation skill of chinese is pre-trained for chinese, and the pronunciation skill detection model for detecting pronunciation skill of english is pre-trained for english. The structure and training mode of the pronunciation skill detection model are not particularly limited, and can be selected by those skilled in the art according to actual needs.
The pronunciation skill detection model is configured to take acoustic characteristics of audio to be detected, which is derived from a speaker to speak a text to be detected, and a phoneme sequence, which is derived from the text to be detected, as inputs, and correspondingly output detection results for representing whether the speaker needs to speak the text to be detected by using pronunciation skill or not and detection results for representing whether the speaker uses pronunciation skill to speak the text to be detected or not.
Correspondingly, in this embodiment, after the above phoneme sequence and the acoustic feature are obtained, the electronic device inputs the obtained phoneme sequence and acoustic feature to a trained pronunciation skill detection model to be matched with the language of the text to be detected, and performs a pronunciation skill detection process to obtain a first detection result and a second detection result output by the pronunciation skill detection model. The first detection result is used for representing whether the speaker needs to speak the text to be detected by adopting the pronunciation skill or not, and the second detection result is used for representing whether the speaker needs to speak the text to be detected by adopting the pronunciation skill or not.
Taking the above text to be detected "Please turn on the light" as an example, according to expert knowledge, it can be known that phonemes "n" and "a" are collocations of continuous reading phonemes, turn and on need to be read continuously, aiming at a phoneme sequence and acoustic characteristics corresponding to the text to be detected, a first detection result output by the pronunciation skill detection model will characterize that the text to be detected needs to be output by adopting pronunciation skill to be read continuously, and depending on whether a speaker uses pronunciation skill to read continuously to say the text to be detected, the pronunciation skill detection model will output a second detection result for.
In addition, the phoneme sequence may be input as an original phoneme sequence, or after digital encoding, the phoneme sequence may be converted into a phoneme sequence in a digital form, that is, the corresponding phonemes are represented by numbers in the converted phoneme sequence. Accordingly, if the phoneme sequence is inputted in a digital form, training of the phoneme sequence sample in a digital form is required when training the pronunciation skill detection model.
In an alternative embodiment, the pronunciation skill detection model includes a phoneme feature extraction network, an acoustic feature enhancement network, a feature fusion network, a first pronunciation skill detection network, and a second pronunciation skill detection network, and the inputting the phoneme sequence and the acoustic feature into the trained pronunciation skill detection model to perform pronunciation skill detection processing to obtain a first detection result and a second detection result, including:
Inputting the phoneme sequence into a phoneme feature extraction network to perform feature extraction processing to obtain a phoneme feature matrix;
Inputting the phoneme feature matrix into a first pronunciation skill detection network to carry out pronunciation skill detection processing to obtain a first detection result;
if the first detection result representation needs to adopt pronunciation skills to speak the text to be detected, inputting the acoustic features into an acoustic feature enhancement network for feature enhancement processing to obtain an enhanced acoustic feature matrix;
inputting the enhanced acoustic feature matrix and the phoneme feature matrix into a feature fusion network to perform feature fusion processing to obtain a fusion feature matrix;
and inputting the fusion feature matrix into a second pronunciation skill detection network to carry out pronunciation skill detection processing to obtain a second detection result.
Referring to fig. 4, the pronunciation skill detection model provided in the present embodiment is composed of 5 major parts, which are a phoneme feature extraction network, an acoustic feature enhancement network, a feature fusion network, a first pronunciation skill detection network and a second pronunciation skill detection network.
The phoneme feature extraction network is configured to perform feature extraction on an input phoneme sequence to obtain a phoneme feature matrix reflecting the inter-phoneme relationship in the phoneme sequence.
The acoustic feature enhancement network is configured to perform feature enhancement processing on the input acoustic features to enhance features therein that are more relevant to pronunciation skills, resulting in an enhanced acoustic feature matrix.
The feature fusion network is configured to perform feature fusion on the input phoneme feature matrix and the enhanced acoustic feature matrix, and does not perform interaction information of the phoneme and the acoustic feature, so as to obtain a fusion feature matrix.
The first pronunciation skill detection network is configured to perform pronunciation skill detection processing on the input phoneme feature matrix and output a first detection result for representing whether the text to be detected needs to be uttered by using pronunciation skill.
The second pronunciation skill detection network is configured to perform pronunciation skill detection processing on the input fusion feature matrix, and output a second detection result for representing whether the speaker speaks the text to be detected by using pronunciation skill.
Accordingly, in this embodiment, when inputting the phoneme sequence and the acoustic feature into the trained pronunciation skill detection model to perform the pronunciation skill detection process, the electronic device may perform feature extraction processing on the phoneme feature extraction network of the phoneme sequence to obtain a phoneme feature matrix, and then input the phoneme feature matrix into the first pronunciation skill detection network to perform the pronunciation skill detection process to obtain the first detection result.
And simultaneously, the electronic equipment inputs the acoustic characteristics into an acoustic characteristic enhancement network to perform characteristic enhancement processing to obtain an enhanced acoustic characteristic matrix, inputs the enhanced acoustic characteristic and the acoustic element characteristic matrix into a characteristic fusion network to perform characteristic fusion processing to obtain a fusion characteristic matrix, inputs the fusion characteristic matrix obtained by fusion into a second pronunciation skill detection network to perform pronunciation skill detection processing to obtain a second detection result.
In addition, the electronic device may determine whether to output the second detection result according to the first detection result. After obtaining a first detection result and a second detection result obtained by performing pronunciation skill detection processing on a pronunciation skill detection model, the electronic device determines whether a text to be detected needs to be spoken by using pronunciation skill according to the first detection result, if the text to be detected needs to be spoken by using pronunciation skill, the electronic device outputs the first detection result and the second detection result at the same time, the first detection result indicates that the text to be detected needs to be spoken by using pronunciation skill, and the second detection result indicates whether a speaker uses pronunciation skill to speak the text to be detected, if the text to be detected does not need to be spoken by using pronunciation skill, the electronic device can discard the second detection result and only output the first detection result.
In other embodiments, if the first detection result indicates that the text to be detected needs to be spoken by using pronunciation skills, the electronic device inputs the acoustic feature into the acoustic feature enhancement network to perform feature enhancement processing, so as to obtain an enhanced acoustic feature matrix. And then, the electronic equipment further inputs the enhanced acoustic feature and the phoneme feature matrix into a feature fusion network to perform feature fusion processing to obtain a fusion feature matrix. And finally, the electronic equipment inputs the fusion feature matrix obtained by fusion into a second pronunciation skill detection network to carry out pronunciation skill detection processing, so as to obtain a second detection result.
In addition, if the first detection result indicates that the text to be detected does not need to be spoken by adopting pronunciation skills, further pronunciation skill detection is not needed, and at this time, the electronic equipment does not use acoustic features to perform pronunciation skill detection any more, and only the first detection result can be output.
In an alternative embodiment, the phoneme feature extraction network includes a phoneme embedding module and a phoneme feature extraction module, and inputting the phoneme sequence into the phoneme feature extraction network to perform feature extraction processing to obtain a phoneme feature matrix, including:
Inputting the phoneme sequence into a phoneme embedding module for embedding treatment to obtain a phoneme vector matrix;
Inputting the phoneme vector matrix into a phoneme feature extraction module for feature extraction processing to obtain a phoneme feature matrix.
Referring to fig. 5, in the present embodiment, the phoneme feature extraction network is composed of two parts, namely a phoneme embedding module and a phoneme feature extraction module, wherein the phoneme embedding module is configured to perform embedding processing on an input phoneme sequence and vector the input phoneme sequence to obtain a phoneme vector matrix, and the phoneme feature extraction module is configured to perform feature extraction on the input phoneme vector matrix to obtain a phoneme feature matrix reflecting the inter-phoneme relationship in the phoneme sequence.
Correspondingly, in this embodiment, when inputting a phoneme sequence into a phoneme feature extraction network to perform feature extraction processing, the electronic device first inputs the phoneme sequence into a phoneme embedding module to perform embedding processing to obtain a phoneme vector matrix, and then inputs the phoneme vector matrix into a phoneme feature extraction module to perform feature extraction processing to obtain a phoneme feature matrix.
The phoneme feature extraction module comprises at least 1 phoneme feature extraction submodule, inputs a phoneme vector matrix into the phoneme feature extraction module to perform feature extraction processing to obtain a phoneme feature matrix, and comprises the following steps:
when the number of the phoneme feature extraction submodules is 1, inputting the phoneme vector matrix into the phoneme feature extraction submodule to perform feature extraction processing to obtain a phoneme feature matrix, or
When the number of the phoneme feature extraction submodules is N, inputting the phoneme vector matrix into the N phoneme feature extraction submodules to sequentially perform feature extraction processing to obtain a phoneme feature matrix, wherein N is an integer larger than 1. The value of N is not particularly limited, and may be configured by those skilled in the art according to actual needs, for example, N may be configured to be 2.
In this embodiment, the audio feature extraction module may be composed of 1 phoneme feature extraction submodule, or may be composed of N phoneme feature extraction submodules connected in sequence. When the audio feature extraction module consists of N phoneme feature extraction sub-modules, each audio extraction sub-module performs the same feature extraction processing. The feature extraction processing procedure of 1 audio feature extraction sub-module will be described below as an example.
Referring to fig. 6, the audio feature extraction sub-module is composed of 3 sub-layers, which are a first matrix conversion layer, a first multi-head attention layer and a first matrix fusion layer, respectively, wherein,
The first matrix conversion layer is configured to perform matrix conversion processing on an input matrix, and convert the input matrix into a query matrix, a key matrix and a value matrix respectively;
the first multi-head attention layer is configured to perform attention enhancement processing on the input query matrix, the key matrix and the value matrix to obtain an attention enhancement matrix;
the first matrix fusion layer is configured to perform matrix fusion processing on an input matrix of the first matrix conversion layer and an output matrix of the first multi-head attention layer to obtain a fusion matrix.
Correspondingly, when the number of the phoneme feature extraction submodules is 1, the electronic device can extract and obtain a phoneme feature matrix according to the following mode:
Inputting the phoneme vector matrix into a first matrix conversion layer for matrix conversion processing to obtain a query matrix, a key matrix and a value matrix, and respectively marking the query matrix, the key matrix and the value matrix as a first query matrix, a first key matrix and a first value matrix;
Inputting the first query matrix, the first key matrix and the first value matrix into a first multi-head attention layer for attention enhancement processing to obtain an attention enhancement matrix, and marking the attention enhancement matrix as the first attention enhancement matrix;
and inputting the first attention enhancement matrix and the phoneme vector matrix into a first matrix fusion layer for matrix fusion processing to obtain a fusion matrix, and recording the fusion matrix as a phoneme characteristic matrix.
It can be understood that when the number of the phoneme feature extraction submodules is N, only the 1 st phoneme feature extraction submodule of the N phoneme feature extraction submodules connected in sequence is needed to input the phoneme vector matrix, the N phoneme feature submodules sequentially perform feature extraction processing in the above manner, and the fusion feature output by the N-th phoneme feature submodule is used as the phoneme feature matrix.
In this embodiment, the first matrix fusion layer is not particularly limited, and may be configured by those skilled in the art according to actual needs.
For example, the first matrix fusion layer may include two sub-layers, namely an addition layer and a layer normalization layer, and when the matrices are fused, the addition layer adds the two input matrices to obtain a sum matrix, and then the layer normalization layer performs layer normalization on the sum matrix to obtain a fusion matrix.
In an alternative embodiment, referring to fig. 7, the phoneme feature extracting network further includes a first position encoding module and a second matrix fusion layer, and before inputting the phoneme vector matrix into the phoneme feature extracting module to perform feature extraction processing, the method further includes:
Inputting the phoneme vector matrix into a first position coding module for position coding processing to obtain a first position coding matrix;
Inputting the first position coding matrix and the phoneme vector matrix into a second matrix fusion layer for matrix fusion processing to obtain a phoneme position fusion matrix;
inputting the phoneme vector matrix into a phoneme feature extraction module for feature extraction processing to obtain a phoneme feature matrix, wherein the feature extraction module comprises the following steps:
Inputting the phoneme position fusion matrix into a phoneme feature extraction module for feature extraction processing to obtain a phoneme feature matrix.
In this embodiment, in order to further improve accuracy of pronunciation skill detection, in this embodiment, the original phoneme vector matrix is not input to the phoneme feature extraction module for feature extraction, but after being subjected to position encoding, the phoneme vector matrix carrying the position information is input to the phoneme feature extraction module for feature extraction.
The electronic device inputs the phoneme vector matrix into a first position coding module to perform position coding processing to obtain a position coding matrix, and records the position coding matrix as a first position coding matrix, wherein the first position coding matrix characterizes the position information of each matrix unit in the phoneme vector matrix, and can be relative position information or absolute position information.
After the first position coding matrix is obtained, the electronic equipment inputs the first position coding matrix and the phoneme vector matrix into a second matrix fusion layer to perform matrix fusion processing, so as to obtain a fusion matrix, and the fusion matrix is recorded as a phoneme position fusion matrix. And the electronic equipment further inputs the phoneme position fusion matrix carrying the position information into a phoneme feature extraction module to perform feature extraction processing to obtain a phoneme feature matrix. For how the phoneme feature extraction module performs feature extraction, please refer to the related description of the above embodiments, and details are not repeated here.
In addition, it should be noted that, in this embodiment, the matrix fusion manner of the second matrix fusion layer is not particularly limited, and may be configured by those skilled in the art according to actual needs. For example, the second matrix fusion layer is configured to perform addition processing on two matrices inputted, and output the sum matrix obtained by addition as a fusion matrix.
In an alternative embodiment, the acoustic feature enhancement network includes a feature encoding module and at least 1 acoustic feature enhancement module, and the inputting the acoustic features into the acoustic feature enhancement network performs feature enhancement processing to obtain an enhanced acoustic feature matrix, including:
inputting the acoustic characteristics into a characteristic coding module for characteristic coding treatment to obtain an acoustic characteristic matrix;
When the number of the acoustic feature enhancement modules is 1, inputting the acoustic feature matrix into the acoustic feature enhancement modules for feature enhancement processing to obtain an enhanced acoustic feature matrix, or
When the number of the acoustic feature enhancement modules is M, inputting the acoustic feature matrix into the M acoustic feature enhancement modules to sequentially perform feature enhancement processing, so as to obtain an enhanced acoustic feature matrix, wherein M is an integer larger than 1. The value of M is not particularly limited, and may be configured by those skilled in the art according to actual needs, for example, M may be configured to be 4.
It should be noted that, in the above embodiment, the obtained Filterbank features, the fundamental frequency features and the energy features are presented in the form of feature graphs, and correspondingly, the acoustic features obtained by fusing the Filterbank features, the fundamental frequency features and the energy features are presented in the form of feature graphs.
In order to effectively perform feature enhancement processing on acoustic features, in this embodiment, the acoustic feature enhancement network is composed of 1 feature encoding module and at least 1 acoustic feature enhancement module, wherein the feature encoding module is configured to encode acoustic features and compress feature dimensions to obtain corresponding acoustic feature matrices, and the acoustic feature enhancement module is configured to perform feature enhancement processing on input acoustic feature matrices to enhance features related to pronunciation skills therein and obtain enhanced acoustic feature matrices.
Correspondingly, when the acoustic features are input into the acoustic feature enhancement network to perform feature enhancement processing, the electronic equipment firstly inputs the acoustic features into the feature coding module to perform feature coding processing to obtain an acoustic feature matrix.
Referring to fig. 8, the feature encoding module is composed of 4 sub-layers including a first convolution layer, a first pooling layer, a second convolution layer and a second pooling layer, wherein,
The first convolution layer is configured to carry out convolution processing on the input feature map to obtain a corresponding convolution result;
the first pooling layer is configured to pooling the convolution result output by the first convolution layer to obtain a corresponding pooling result;
the second convolution layer is configured to carry out convolution processing on the pooling result output by the first pooling layer to obtain a corresponding convolution result;
the second pooling layer is configured to pool the convolution result output by the second convolution layer to obtain a feature matrix of the corresponding feature map.
Accordingly, the electronic device may perform feature encoding processing on the acoustic owner input feature encoding module as follows:
Inputting the acoustic features into a first convolution layer for convolution processing to obtain a convolution result, and recording the convolution result as a first convolution result;
Inputting the first convolution result into a first pooling layer for pooling treatment to obtain a pooling result, and marking the pooling result as a first pooling result;
inputting the first pooling result into a second convolution layer for convolution treatment to obtain a convolution result, and recording the convolution result as a second convolution result;
And inputting the second convolution result into a second pooling layer for pooling treatment to obtain an acoustic feature matrix.
It should be noted that, in this embodiment, the convolution kernel sizes, step sizes, and padding sizes of the first convolution layer and the second convolution layer are not specifically limited, and may be configured by those skilled in the art according to actual needs.
For example, in this embodiment, the convolution kernel size of the first convolution layer is [3,3], the step size is [1,1], the padding size is [1,1], the convolution kernel size of the second convolution layer is [3,3], the step size is [1,1], the padding size is [1,1], the pooling type of the first pooling layer is configured as maximum pooling, the pooling kernel size is [2,2], the step size is [1,1], the pooling type of the second pooling layer is configured as maximum pooling, the pooling kernel size is [2,2], and the step size is [1,1].
Further, when the number of the acoustic feature enhancement modules is 1, the electronic device directly inputs the acoustic feature matrix into the acoustic feature enhancement modules to perform feature enhancement processing, so as to obtain an enhanced acoustic feature matrix.
The feature enhancement processing procedure of 1 acoustic feature enhancement module is described below as an example.
Referring to fig. 9, the acoustic feature enhancement module is composed of 6 sub-layers, which are a second matrix conversion layer, a second multi-head attention layer, a third matrix fusion layer, a third convolution layer, an inverse convolution layer, and a fourth matrix fusion layer. Wherein, the
The second matrix conversion layer is configured to perform matrix conversion processing on the input matrix, and convert the input matrix into a query matrix, a key matrix and a value matrix respectively;
the second multi-head attention layer is configured to perform attention enhancement processing on the input query matrix, the key matrix and the value matrix to obtain an attention enhancement matrix;
The third matrix fusion layer is configured to perform matrix fusion processing on the input matrix of the second matrix conversion layer and the output matrix of the second multi-head attention layer to obtain a fusion matrix;
The third convolution layer is configured to carry out convolution processing on the input fusion matrix to obtain a convolution result;
the deconvolution layer is configured to carry out deconvolution processing on the input convolution result to obtain a matrix-form deconvolution result;
The fourth matrix fusion layer is configured to perform matrix fusion processing on the fusion matrix output by the third matrix fusion layer and the deconvolution result output by the deconvolution layer to obtain a fusion matrix.
Accordingly, when the number of the acoustic feature enhancement modules is 1, the electronic device may enhance the obtained enhanced acoustic feature matrix as follows:
inputting the acoustic feature matrix into a second matrix conversion layer for matrix conversion processing to obtain a query matrix, a key matrix and a value matrix, and respectively marking the query matrix, the key matrix and the value matrix as a second query matrix, a second key matrix and a second value matrix;
inputting the second query matrix, the second key matrix and the second value matrix into a second multi-head attention layer for attention enhancement processing to obtain an attention enhancement matrix, and marking the attention enhancement matrix as a second attention enhancement matrix;
inputting the second attention enhancement matrix and the acoustic feature matrix into a third matrix fusion layer for matrix fusion processing to obtain a fusion matrix, and marking the fusion matrix as an acoustic fusion matrix;
Inputting the acoustic fusion matrix into a third convolution layer for convolution processing to obtain a third convolution result;
inputting the third convolution result into the deconvolution layer for deconvolution treatment to obtain a deconvolution result in a matrix form;
and inputting the acoustic fusion matrix and the deconvolution result into a fourth matrix fusion layer for matrix fusion processing to obtain a fusion matrix, and marking the fusion matrix as an enhanced acoustic feature matrix.
The method for fusing the third matrix fusion layer and the fourth matrix fusion layer is not particularly limited in this embodiment, and may be configured by those skilled in the art according to actual needs.
For example, the third matrix fusion layer and the fourth matrix fusion layer have the same structure and respectively comprise two sub-layers, namely an addition layer and a layer normalization layer, when the matrixes are fused, the addition layer is used for adding the two input matrixes to obtain a sum matrix, and then the layer normalization layer is used for carrying out layer normalization on the sum matrix to obtain the fusion matrix.
It should be noted that, when the number of acoustic feature enhancement modules is M, each acoustic feature enhancement module performs the same feature enhancement processing, and only the 1 st acoustic feature enhancement module of the M acoustic feature enhancement modules connected in sequence is required to input the acoustic feature matrix, the M acoustic feature enhancement modules sequentially perform feature enhancement processing in the above manner, and the fusion feature output by the M acoustic feature enhancement modules is used as the enhanced acoustic feature matrix.
In addition, the configuration of the convolution kernel size, the step size, and the padding size in the above third convolution layer and the deconvolution layer is not particularly limited in this embodiment, and may be taken by those skilled in the art according to actual needs.
It will be appreciated that the present implementation can more effectively enhance acoustic features in the form of feature maps by enhancing the convolution process and the deconvolution process during the enhancement of acoustic features, ultimately extracting features that are more relevant to pronunciation skills.
In an optional embodiment, the acoustic feature enhancement network further includes a second position encoding module and a fifth matrix fusion layer, and before inputting the acoustic feature matrix into the acoustic feature enhancement module for feature enhancement processing, the method further includes:
Inputting the acoustic feature matrix into a second position coding module for position coding processing to obtain a second position coding matrix;
inputting the second position coding matrix and the acoustic feature matrix into a fifth matrix fusion layer for matrix fusion processing to obtain an acoustic position fusion matrix;
Inputting the acoustic feature matrix into an acoustic feature enhancement module for feature enhancement processing to obtain an enhanced acoustic feature matrix, comprising:
And inputting the acoustic position fusion matrix into an acoustic feature enhancement module for feature enhancement processing to obtain an enhanced acoustic feature matrix.
In this embodiment, in order to further improve accuracy of pronunciation skill detection, in this embodiment, the original acoustic feature matrix is not input to the acoustic feature enhancement module for feature enhancement, but after being subjected to position encoding, the acoustic feature matrix carrying the position information is input to the acoustic feature enhancement module for feature enhancement.
The electronic device inputs the acoustic feature matrix into a second position coding module to perform position coding processing to obtain a position coding matrix, and records the position coding matrix as a second position coding matrix, wherein the second position coding matrix characterizes the position information of each matrix unit in the acoustic feature matrix, and can be relative position information or absolute position information.
After the second position coding matrix is obtained, the electronic equipment inputs the second position coding matrix and the acoustic feature matrix into a fifth matrix fusion layer to perform matrix fusion processing, so as to obtain a fusion matrix, and the fusion matrix is recorded as an acoustic position fusion matrix. And then, the electronic equipment further inputs the acoustic position fusion matrix carrying the position information into an acoustic feature enhancement module to perform feature enhancement processing to obtain an enhanced acoustic feature matrix. For how the acoustic feature enhancement module performs the feature enhancement process, please refer to the related description of the above embodiments, and the description is omitted here.
In addition, it should be noted that, in this embodiment, the matrix fusion manner of the fifth matrix fusion layer is not specifically limited, and may be configured by those skilled in the art according to actual needs. For example, the fifth matrix fusion layer is configured to perform addition processing on the two matrices inputted, and output the sum matrix obtained by addition as a fusion matrix.
In an alternative embodiment, referring to fig. 10, the feature fusion network includes a third matrix inversion layer, a fourth matrix inversion layer, a third multi-headed attention layer, a sixth matrix fusion layer, a feed forward network layer, and a seventh matrix fusion layer, wherein,
The third matrix conversion layer is configured to perform matrix conversion processing on the input matrix to obtain a key matrix and a value matrix;
the fourth matrix conversion layer is configured to perform matrix conversion processing on the input matrix to obtain a query matrix;
The third multi-head attention layer is configured to perform attention enhancement processing on the key matrix, the value matrix and the query matrix output by the fourth matrix conversion layer to obtain an attention enhancement matrix;
The sixth matrix fusion layer is configured to perform matrix fusion processing on the attention enhancement matrix output by the third multi-head attention layer to obtain a fusion matrix;
the feedforward network layer is configured to perform feedforward calculation processing on the fusion matrix output by the sixth matrix fusion layer to obtain a feedforward matrix;
The seventh matrix fusion layer is configured to perform matrix fusion processing on the feedforward matrix output by the feedforward network layer and the fusion matrix output by the sixth matrix fusion layer to obtain a fusion feature matrix.
Accordingly, the electronic device may input the enhanced acoustic feature matrix and the phoneme feature matrix into the feature fusion network to perform feature fusion processing in the following manner:
inputting the enhanced acoustic feature matrix into a third matrix conversion layer for matrix conversion treatment to obtain a key matrix and a value matrix which are respectively marked as a third key matrix and a third value matrix;
Inputting the phoneme characteristic matrix into a fourth matrix conversion layer for matrix conversion treatment to obtain a query matrix, and marking the query matrix as a third query matrix;
Inputting the third query matrix, the third key matrix and the third value matrix into a third multi-head attention layer for attention enhancement processing to obtain an attention enhancement matrix, and marking the attention enhancement matrix as a third attention enhancement matrix;
inputting the third attention enhancement matrix and the third query matrix into a sixth matrix fusion layer for matrix fusion processing to obtain a fusion matrix, and recording the fusion matrix as an acoustic phoneme fusion matrix;
Inputting the acoustic phoneme fusion matrix into a feedforward network layer for feedforward calculation processing to obtain a feedforward matrix;
And inputting the feedforward matrix and the acoustic phoneme fusion matrix into a seventh matrix fusion layer for matrix fusion processing to obtain a fusion matrix, and recording the fusion matrix as a fusion feature matrix.
The method for fusing the matrix is not particularly limited in this embodiment, and may be configured by those skilled in the art according to actual needs.
For example, the sixth matrix fusion layer and the seventh matrix fusion layer have the same structure and respectively comprise two sub-layers, namely an addition layer and a layer normalization layer, when the matrices are fused, the addition layer adds the two input matrices to obtain a sum matrix, and then the layer normalization layer performs layer normalization processing on the sum matrix to obtain the fusion matrix.
In an alternative embodiment, referring to fig. 11, the first speaking skill detection network includes a first full-connection layer and a first classification function layer, and the inputting the phoneme feature matrix into the first speaking skill detection network to perform the speaking skill detection processing, so as to obtain a first detection result, where the first detection result includes:
Inputting the phoneme characteristic matrix into a first full-connection layer for full-connection processing to obtain a first full-connection result;
And inputting the first full-connection result into a first classification function layer for classification processing to obtain a first detection result.
It should be noted that, since the present embodiment is directed to performing pronunciation skill detection on multiple classes of pronunciation skills, any multiple classification functions may be used for the first classification function layer.
Taking the Softmax function as an example, the dimension of the output vector of the Softmax function is matched with the number of pronunciation skills expected to be detected, for example, taking English language as an example, the pronunciation skills expected to be detected comprise continuous reading, explosion losing and blushing, the output vector of the Softmax function comprises four-dimension elements, wherein 1 element is used for representing whether the pronunciation skill is needed to be used for 'continuous reading' speaking the text to be detected, 1 element is used for representing whether the pronunciation skill is needed to be used for 'explosion losing' speaking the text to be detected, 1 element is used for representing whether the pronunciation skill is needed to be used for 'blushing' speaking the text to be detected, and one element is used for representing that the pronunciation skill is not needed to be used for speaking the text to be detected.
Correspondingly, inputting the first full connection result into the Softmax function to obtain a 4-dimensional output vector of the Softmax function, taking the 4-dimensional output vector as a first detection result, and determining whether to speak the text to be detected by adopting pronunciation skills according to the first detection result, wherein when the text to be detected is required to be spoken by adopting the pronunciation skills, the text to be detected specifically needs to be spoken by adopting the pronunciation skills.
In an alternative embodiment, the second pronunciation skill detection network includes L branch detection networks, each branch detection network corresponds to a different pronunciation skill, L is an integer greater than 1, and the fused feature matrix is input into the second pronunciation skill detection network to perform pronunciation skill detection processing, so as to obtain a second detection result, where the method includes:
Inputting the fusion feature matrix into each branch detection network to detect pronunciation skills to obtain a branch pronunciation skill detection result of each branch detection network, wherein the branch pronunciation skill detection result of each branch detection network represents whether a speaker adopts pronunciation skills corresponding to each branch detection network to speak a text to be detected;
And obtaining a second detection result according to the branch pronunciation skill detection result of each branch detection network.
Wherein each branch detection network corresponds to a pronunciation skill, is configured to detect whether a speaker speaks a text to be detected by adopting the pronunciation skill corresponding to the speaker, and correspondingly, the number of branch detection networks can obtain the number of branch pronunciation skill detection results,
For example, taking english language as an example, when the pronunciation skills desired to be detected include continuous reading, loss of blasting, and blushing, L takes a value of 3, that is, the second pronunciation skill detection network will include 3 branch detection networks, where 1 branch detection network corresponds to pronunciation skill "continuous reading", 1 branch detection network corresponds to pronunciation skill "loss of blasting", and 1 branch detection network corresponds to pronunciation skill "blushing". Correspondingly, the 3 branch detection networks respectively output 1 branch pronunciation skill detection result, and a total of 3 branch pronunciation skill detection results are combined into a second detection result. At this time, the second detection result characterizes whether the speaker uses pronunciation skills to speak the text to be detected, and when the speaker uses pronunciation skills to speak the text to be detected, what pronunciation skills are specifically adopted to speak the text to be detected.
It should be noted that, the structure of each branch detection network is the same, and a branch detection network is taken as an example for description, referring to fig. 12, the branch detection network includes a second full connection layer and a second classification function layer, the fusion feature matrix is input into each branch detection network to perform pronunciation skill detection, so as to obtain a branch pronunciation skill detection result of each branch detection network, which includes:
Inputting the fusion feature matrix into a second full-connection layer for full-connection processing to obtain a second full-connection result;
And inputting the second full-connection result into a second classification function layer for classification processing to obtain a branch pronunciation skill detection result.
It should be noted that the second classification function layer may employ any two-classification function.
Taking a sigmoid function as an example, the output value of the sigmoid function is in [0,1], and after model training, the output of the sigmoid function can represent the probability that a speaker adopts the pronunciation skill corresponding to the branch detection network where the speaker is located to speak the text to be detected. For example, when the output value of the sigmoid function reaches a preset threshold (the output value can be taken by a person skilled in the art according to actual needs), it can be determined that the speaker uses the pronunciation skill corresponding to the branch detection network to speak the text to be detected.
Correspondingly, the second full-connection result is input into the sigmoid function, the output value of the sigmoid function is obtained, and the output value is used as a branch pronunciation skill detection result. According to the branch pronunciation skill detection result, whether the speaker adopts the pronunciation skill corresponding to the branch detection network to speak the text to be detected can be determined.
In an alternative embodiment, the method further includes, before obtaining the text to be detected and converting the text to be detected into the corresponding phoneme sequence:
Acquiring a plurality of types of first sample texts which are known to be required to be uttered by different pronunciation skills, and converting each type of first sample text into a corresponding positive sample phoneme sequence;
acquiring first sample audio of each type of first sample text by a sample user by adopting different pronunciation skills, and extracting positive sample acoustic features of the first sample audio of each type of first sample text;
acquiring a second sample text which is known to be not required to be uttered by adopting pronunciation skills, and converting the second sample text into a corresponding negative sample phoneme sequence;
acquiring second sample audio of a second sample text spoken by a sample user, and extracting negative sample acoustic features of the second sample audio;
And performing model training according to each type of positive sample phoneme sequence, each type of positive sample acoustic feature, negative sample phoneme sequence and negative sample acoustic feature to obtain a pronunciation skill detection model.
In this embodiment, the acoustic feature samples and the phoneme sequence samples are not artificially constructed, but the model is made to learn different pronunciation skills from a large amount of data from the data-driven thought. The following description will be given by taking a specific language as an example.
For this language, the electronic device obtains a plurality of types of first sample text, respectively, known to require utterances with different pronunciation skills. The number of the first sample text of each type of pronunciation skill acquired is not particularly limited herein, and may be configured by those skilled in the art according to actual needs.
For the first sample text (hereinafter referred to as first sample of each type) of the pronunciation skill of each type, the electronic device converts the first sample of each type into a phoneme sequence for each type, and records the phoneme sequence as a positive sample phoneme sequence. For how to convert the first text sample into the phoneme sequence, the method of converting the text to be detected into the phoneme sequence in the above embodiment may be correspondingly implemented, which is not described herein.
The electronic device also obtains audio of each type of first sample text spoken by the sample user with different pronunciation skills, and marks the audio as first sample audio, and extracts acoustic features of the first sample audio of each type of first sample text, and marks the acoustic features as positive sample acoustic features. The sample user may be a real person with pronunciation skills or a virtual person with pronunciation skills, and accordingly, for how to obtain the first sample audio of each type of the first sample text and how to extract the acoustic features of the positive sample, the method of obtaining the audio to be detected and extracting the acoustic features of the audio to be detected in the above embodiment may be correspondingly implemented, which is not described herein.
The electronic device also obtains text known not to require pronunciation skills to speak, notes as a second sample text, and converts the second sample text into a corresponding phoneme sequence, notes as a negative sample phoneme sequence. For how to convert the second sample text into the phoneme sequence, the method of converting the text to be detected into the phoneme sequence in the above embodiment may be correspondingly implemented, which is not described herein.
The electronic device also obtains audio of the second sample text spoken by the sample user, noted as second sample audio, and extracts acoustic features of the second sample audio, noted as negative sample acoustic features. For how to obtain the second sample audio of the second sample text and how to extract the negative sample acoustic feature, the method of obtaining the audio to be detected and extracting the acoustic feature of the audio to be detected in the above embodiment may be correspondingly implemented, which is not described herein.
It should be noted that the number of sample users is not particularly limited in this embodiment, and may be configured by those skilled in the art according to actual needs, for example, the positive sample phoneme sequence, the positive sample acoustic feature, the negative sample phoneme sequence, and the negative sample acoustic feature are obtained by 500 sample users in this embodiment.
After the positive sample phoneme sequence, the positive sample acoustic feature, the negative sample phoneme sequence and the negative sample acoustic feature are obtained, the electronic device performs model training according to each type of positive sample phoneme sequence, each type of positive sample acoustic feature, the negative sample phoneme sequence and the negative sample acoustic feature until a preset stopping condition is met, and a pronunciation skill detection model is obtained. The preset stopping condition can be configured to enable the iteration times of the model to reach the preset times in the training process or enable the model to converge.
Referring to fig. 13, in order to better execute the pronunciation skill detection method provided by the present application, the present application further provides a pronunciation skill detection device 400, as shown in fig. 13, the pronunciation skill detection device 400 includes:
the first obtaining module 410 is configured to obtain a text to be detected, and convert the text to be detected into a corresponding phoneme sequence;
the second obtaining module 420 is configured to obtain audio to be detected obtained by a speaker speaking a text to be detected, and extract acoustic features of the audio to be detected;
the detection module 430 is configured to input the phoneme sequence and the acoustic feature into a trained pronunciation skill detection model to perform pronunciation skill detection processing, so as to obtain a first detection result and a second detection result;
the first detection result is used for representing whether the speaker needs to speak the text to be detected by adopting the pronunciation skill or not, and the second detection result is used for representing whether the speaker needs to speak the text to be detected by adopting the pronunciation skill or not.
In an alternative embodiment, the pronunciation skill detection model includes a phoneme feature extraction network, an acoustic feature enhancement network, a feature fusion network, a first pronunciation skill detection network, and a second pronunciation skill detection network, and the detection module 430 is configured to:
Inputting the phoneme sequence into a phoneme feature extraction network to perform feature extraction processing to obtain a phoneme feature matrix;
Inputting the phoneme feature matrix into a first pronunciation skill detection network to carry out pronunciation skill detection processing to obtain a first detection result;
if the first detection result representation needs to adopt pronunciation skills to speak the text to be detected, inputting the acoustic features into an acoustic feature enhancement network for feature enhancement processing to obtain an enhanced acoustic feature matrix;
inputting the enhanced acoustic feature matrix and the phoneme feature matrix into a feature fusion network to perform feature fusion processing to obtain a fusion feature matrix;
and inputting the fusion feature matrix into a second pronunciation skill detection network to carry out pronunciation skill detection processing to obtain a second detection result.
In an alternative embodiment, the phoneme feature extraction network comprises a phoneme embedding module and a phoneme feature extraction module, and the detection module 430 is configured to:
Inputting the phoneme sequence into a phoneme embedding module for embedding treatment to obtain a phoneme vector matrix;
Inputting the phoneme vector matrix into a phoneme feature extraction module for feature extraction processing to obtain a phoneme feature matrix.
In an alternative embodiment, the phoneme feature extraction module comprises at least 1 phoneme feature extraction sub-module, and the detection module 430 is configured to:
when the number of the phoneme feature extraction submodules is 1, inputting the phoneme vector matrix into the phoneme feature extraction submodule to perform feature extraction processing to obtain a phoneme feature matrix, or
When the number of the phoneme feature extraction submodules is N, inputting the phoneme vector matrix into the N phoneme feature extraction submodules to sequentially perform feature extraction processing to obtain a phoneme feature matrix, wherein N is an integer larger than 1.
In an alternative embodiment, the phoneme feature extracting submodule includes a first matrix conversion layer, a first multi-headed attention layer and a first matrix fusion layer, and the detecting module 430 is configured to:
Inputting the phoneme vector matrix into a first matrix conversion layer for matrix conversion processing to obtain a first query matrix, a first key matrix and a first value matrix;
Inputting a first query matrix, a first key matrix and a first value matrix into a first multi-head attention layer for attention enhancement processing to obtain a first attention enhancement matrix;
inputting the first attention enhancement matrix and the phoneme vector matrix into a first matrix fusion layer for matrix fusion processing to obtain a phoneme characteristic matrix.
In an alternative embodiment, the phoneme feature extracting network further includes a first position encoding module and a second matrix fusion layer, and before inputting the phoneme vector matrix into the phoneme feature extracting module for feature extraction processing, the detecting module 430 is further configured to:
Inputting the phoneme vector matrix into a first position coding module for position coding processing to obtain a first position coding matrix;
Inputting the first position coding matrix and the phoneme vector matrix into a second matrix fusion layer for matrix fusion processing to obtain a phoneme position fusion matrix;
when the phoneme vector matrix is input to the phoneme feature extraction module to perform feature extraction processing to obtain a phoneme feature matrix, the detection module 430 is configured to input the phoneme position fusion matrix to the phoneme feature extraction module to perform feature extraction processing to obtain a phoneme feature matrix.
In an alternative embodiment, the acoustic feature enhancement network includes a feature encoding module and at least 1 acoustic feature enhancement module, and the detecting module 430 is configured to:
inputting the acoustic characteristics into a characteristic coding module for characteristic coding treatment to obtain an acoustic characteristic matrix;
When the number of the acoustic feature enhancement modules is 1, inputting the acoustic feature matrix into the acoustic feature enhancement modules for feature enhancement processing to obtain an enhanced acoustic feature matrix, or
When the number of the acoustic feature enhancement modules is M, inputting the acoustic feature matrix into the M acoustic feature enhancement modules to sequentially perform feature enhancement processing, so as to obtain an enhanced acoustic feature matrix, wherein M is an integer larger than 1.
In an alternative embodiment, the feature encoding module includes a first convolution layer, a first pooling layer, a second convolution layer, and a second pooling layer, and the detection module 430 is configured to:
Inputting the acoustic features into a first convolution layer for convolution processing to obtain a first convolution result;
inputting the first convolution result into a first pooling layer for pooling treatment to obtain a first pooling result;
Inputting the first pooling result into a second convolution layer for convolution treatment to obtain a second convolution result;
And inputting the second convolution result into a second pooling layer for pooling treatment to obtain an acoustic feature matrix.
In an alternative embodiment, the acoustic feature enhancement module includes a second matrix conversion layer, a second multi-headed attention layer, a third matrix fusion layer, a third convolution layer, a deconvolution layer, and a fourth matrix fusion layer, and the detection module 430 is configured to:
Inputting the acoustic feature matrix into a second matrix conversion layer for matrix conversion processing to obtain a second query matrix, a second key matrix and a second value matrix;
Inputting the second query matrix, the second key matrix and the second value matrix into a second multi-head attention layer for attention enhancement processing to obtain a second attention enhancement matrix;
Inputting the second attention enhancement matrix and the acoustic feature matrix into a third matrix fusion layer for matrix fusion processing to obtain an acoustic fusion matrix;
Inputting the acoustic fusion matrix into a third convolution layer for convolution processing to obtain a third convolution result;
Inputting the third convolution result into the deconvolution layer for deconvolution treatment to obtain an deconvolution result;
and inputting the acoustic fusion matrix and the deconvolution result into a fourth matrix fusion layer for matrix fusion processing to obtain the enhanced acoustic feature matrix.
In an alternative embodiment, the acoustic feature enhancement network further includes a second position encoding module and a fifth matrix fusion layer, and before the acoustic feature matrix is input to the acoustic feature enhancement module for feature enhancement processing, the detection module 430 is further configured to:
Inputting the acoustic feature matrix into a second position coding module for position coding processing to obtain a second position coding matrix;
inputting the second position coding matrix and the acoustic feature matrix into a fifth matrix fusion layer for matrix fusion processing to obtain an acoustic position fusion matrix;
When the acoustic feature matrix is input to the acoustic feature enhancement module for feature enhancement processing to obtain an enhanced acoustic feature matrix, the detection module 430 is configured to:
And inputting the acoustic position fusion matrix into an acoustic feature enhancement module for feature enhancement processing to obtain an enhanced acoustic feature matrix.
In an alternative embodiment, the feature fusion network includes a third matrix conversion layer, a fourth matrix conversion layer, a third multi-headed attention layer, a sixth matrix fusion layer, a feed forward network layer, and a seventh matrix fusion layer, and the detection module 430 is configured to:
inputting the enhanced acoustic feature matrix into a third matrix conversion layer for matrix conversion treatment to obtain a third key matrix and a third value matrix;
inputting the phoneme characteristic matrix into a fourth matrix conversion layer for matrix conversion treatment to obtain a third query matrix;
Inputting the third query matrix, the third key matrix and the third value matrix into a third multi-head attention layer for attention enhancement processing to obtain a third attention enhancement matrix;
inputting the third attention enhancement matrix and the third query matrix into a sixth matrix fusion layer for matrix fusion processing to obtain an acoustic phoneme fusion matrix;
Inputting the acoustic phoneme fusion matrix into a feedforward network layer for feedforward calculation processing to obtain a feedforward matrix;
And inputting the feedforward matrix and the acoustic phoneme fusion matrix into a seventh matrix fusion layer for matrix fusion processing to obtain a fusion feature matrix.
In an alternative embodiment, the first sounding skill detection network includes a first full connection layer and a first classification function layer, and the detection module 430 is configured to:
Inputting the phoneme characteristic matrix into a first full-connection layer for full-connection processing to obtain a first full-connection result;
And inputting the first full-connection result into a first classification function layer for classification processing to obtain a first detection result.
In an alternative embodiment, the second pronunciation skill detection network includes L branch detection networks, each branch detection network corresponding to a different pronunciation skill, L being an integer greater than 1, and the detection module 430 is configured to:
Inputting the fusion feature matrix into each branch detection network to detect pronunciation skills to obtain a branch pronunciation skill detection result of each branch detection network, wherein the branch pronunciation skill detection result of each branch detection network represents whether a speaker adopts pronunciation skills corresponding to each branch detection network to speak a text to be detected;
And obtaining a second detection result according to the branch pronunciation skill detection result of each branch detection network.
In an alternative embodiment, the branch detection network includes a second full connection layer and a second classification function layer, and the detection module 430 is configured to:
Inputting the fusion feature matrix into a second full-connection layer for full-connection processing to obtain a second full-connection result;
And inputting the second full-connection result into a second classification function layer for classification processing to obtain a branch pronunciation skill detection result.
In an alternative embodiment, the pronunciation skill detecting device provided by the application further includes a training module for:
Acquiring a plurality of types of first sample texts which are known to be required to be uttered by different pronunciation skills, and converting each type of first sample text into a corresponding positive sample phoneme sequence;
acquiring first sample audio of each type of first sample text by a sample user by adopting different pronunciation skills, and extracting positive sample acoustic features of the first sample audio of each type of first sample text;
acquiring a second sample text which is known to be not required to be uttered by adopting pronunciation skills, and converting the second sample text into a corresponding negative sample phoneme sequence;
acquiring second sample audio of a second sample text spoken by a sample user, and extracting negative sample acoustic features of the second sample audio;
And performing model training according to each type of positive sample phoneme sequence, each type of positive sample acoustic feature, negative sample phoneme sequence and negative sample acoustic feature to obtain a pronunciation skill detection model.
In an alternative embodiment, the second obtaining module 420 is configured to:
Extracting Filterbank features, fundamental frequency features and energy features of the audio to be detected;
and fusing the Filterbank characteristic, the fundamental frequency characteristic and the energy characteristic to obtain the acoustic characteristic.
In an alternative embodiment, the first obtaining module 410 is configured to:
Removing text units which are not pronounciated in the text to be detected to obtain a new text to be detected;
and converting each text unit in the new text to be detected into a corresponding phoneme unit to obtain a phoneme sequence.
It should be noted that, the pronunciation skill detecting device 400 provided in the embodiment of the present application and the pronunciation skill detecting method in the above embodiment belong to the same concept, and detailed implementation processes thereof are described in the above related embodiments, which are not repeated here.
The embodiment of the application also provides an electronic device, which comprises a memory and a processor, wherein the processor is used for executing the steps in the pronunciation skill detection method provided by the embodiment by calling the computer program stored in the memory.
Referring to fig. 14, fig. 14 is a schematic structural diagram of an electronic device 100 according to an embodiment of the application.
The electronic device 100 may include a network interface 110, a memory 120, a processor 130, screen components, and the like. Those skilled in the art will appreciate that the configuration of the electronic device 100 shown in fig. 14 does not constitute a limitation of the electronic device 100, and may include more or fewer components than shown, or may combine certain components, or may have a different arrangement of components.
The network interface 110 may be used to make network connections between devices.
Memory 120 may be used to store computer programs and data. The memory 120 stores a computer program having executable code included therein. The computer program may be divided into various functional modules. The processor 130 executes various functional applications and data processing by running a computer program stored in the memory 120.
The processor 130 is a control center of the electronic device 100, connects various parts of the entire electronic device 100 using various interfaces and lines, and performs various functions of the electronic device 100 and processes data by running or executing computer programs stored in the memory 120 and calling data stored in the memory 120, thereby controlling the electronic device 100 as a whole.
In the embodiment of the present application, the processor 130 in the electronic device 100 loads one or more executable codes corresponding to one or more computer programs into the memory 120 according to the following instructions, and the steps in the pronunciation skill detection method provided by the present application are executed by the processor 130, for example:
Obtaining a text to be detected, and converting the text to be detected into a corresponding phoneme sequence;
Acquiring audio to be detected, which is obtained by a speaker speaking a text to be detected, and extracting acoustic characteristics of the audio to be detected;
inputting the phoneme sequence and the acoustic characteristics into a trained pronunciation skill detection model to carry out pronunciation skill detection processing, so as to obtain a first detection result and a second detection result;
the first detection result is used for representing whether the speaker needs to speak the text to be detected by adopting the pronunciation skill or not, and the second detection result is used for representing whether the speaker needs to speak the text to be detected by adopting the pronunciation skill or not.
It should be noted that, the electronic device 100 provided in the embodiment of the present application and the method for detecting pronunciation skills in the above embodiment belong to the same concept, and detailed implementation processes of the method are described in the above related embodiments, which are not repeated here.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed on a processor of an electronic device provided by an embodiment of the present application, causes the processor of the electronic device to perform any of the above steps in a pronunciation skill detection method suitable for the electronic device. The storage medium may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.
The foregoing describes the method, apparatus, storage medium and electronic device for detecting pronunciation skills according to the present application in detail, and specific examples are used herein to describe the principles and embodiments of the present application, and the above examples are only for helping to understand the method and core ideas of the present application, and meanwhile, those skilled in the art should not understand the present application to limit the scope of the present application according to the specific embodiments and application.